# --- Description --- #
## Preparation and analysis of the top 1500 steam sales in 2024 (up to Aug?) ##

We have a dataset from Kaggle (put link here?) that contains the records of the top 1500 games (by total revenue) sold on Steam from Jan to Aug (?) 2024.

### Preparing SQL for use with Jupyter Notebook ###

We will be using SQL-lite for this project. This is because it is very quick and easy to get working with Jupyter notebook, but it lacks many advanced features of other SQL variants (such as stored procedures, user-defined functions, some pre-defined statistical functions, etc). There is an accompanying SQL file in the same folder as this notebook (same filename, just with a .sql extension instead of .ipynb) that does the same analysis as here but uses MySQL.

Also note that the magic-sql command (%%sql) will cause jupyter to interpret everything in the code block as sql code - comments will generally not be present directly in these blocks because of this. Brief descriptions of what each block does are printed *<b>like this</b>* above them.

Let's first load SQL externally and initialize the database in the environment:

In [1]:
%load_ext sql
%sql sqlite:///steam_revenue.db

### Load the data as an SQL table ###

A combinations of pandas and sqlite are used in order to properly load and connect the database.

In [2]:
import pandas as pd
import sqlite3

# Load the CSV using pandas
df = pd.read_csv('Steam_2024_bestRevenue_1500.csv')
# Connect to SQLite
conn = sqlite3.connect('steam_revenue.db')
# Load the DataFrame into a SQL table in the SQLite database
df.to_sql('steam_revenue', conn, if_exists='replace', index=False)
# Close the connection (since it's now loaded)
conn.close()

And let's confirm that the table has been properly loaded.

In [3]:
%%sql
SELECT * 
    FROM steam_revenue
    LIMIT 5
;

name,releaseDate,copiesSold,price,revenue,avgPlaytime,reviewScore,publisherClass,publishers,developers,steamId
WWE 2K24,07-03-2024,165301,99.99,8055097.0,42.36514031444467,71,AAA,2K,Visual Concepts,2315690
EARTH DEFENSE FORCE 6,25-07-2024,159806,59.99,7882151.0,29.65106126155342,57,Indie,D3PUBLISHER,SANDLOT,2291060
Sins of a Solar Empire II,15-08-2024,214192,49.99,7815247.0,12.45259326556514,88,Indie,Stardock Entertainment,"Ironclad Games Corporation,Stardock Entertainment",1575940
Legend of Mortal,14-06-2024,440998,19.99,7756399.0,24.79781729089117,76,Indie,"Paras Games,Obb Studio Inc.",Obb Studio Inc.,1859910
Shin Megami Tensei V: Vengeance,13-06-2024,141306,59.99,7629252.0,34.25849627863547,96,AA,SEGA,ATLUS,1875830


We'll also modify SqlMagic.displaylimit to None so that all rows get printed when asked.

In [121]:
%config SqlMagic.displaylimit = None

The data was successfully loaded, so let's proceed with the analysis

# --- Pre-processing --- #

The table contains 4 string columns (name, publisherClass, publishers, developers), 1 date column (releaseDate), 3 integer columns (copiesSolid, reviewScore, and steamId), and 3 float columns (price, revenue, and avgPlaytime). 11 columns in total.

Let's look at the top 10 and bottom 10 earning games to see if anything immediately stands out

###### Show top-10 games by revenue

In [118]:
%%sql
SELECT *
    FROM steam_revenue
    ORDER BY revenue DESC
    LIMIT 10
;

name,releaseDate,copiesSold,price,revenue,avgPlaytime,reviewScore,publisherClass,publishers,developers,steamId
Black Myth: Wukong,19-08-2024,15517278,60.0,837793356.0,20.1,96.0,AAA,Game Science,Game Science,2358720
HELLDIVERS™ 2,08-02-2024,11905198,40.0,435635596.0,39.2,71.0,AAA,PlayStation Publishing LLC,Arrowhead Game Studios,553850
Palworld,18-01-2024,16704850,30.0,392328553.0,41.8,94.0,AA,Pocketpair,Pocketpair,1623730
Sons Of The Forest,22-02-2024,8693478,30.0,217017892.0,17.3,86.0,AA,Newnight,Endnight Games Ltd,1326470
Dragon's Dogma 2,21-03-2024,1785028,70.0,111478291.0,31.7,,AAA,"CAPCOM Co., Ltd.","CAPCOM Co., Ltd.",2054970
The First Descendant,30-06-2024,4043850,0.0,102244808.0,49.9,55.0,AA,NEXON,"NEXON Games Co., Ltd.",2074920
Last Epoch,21-02-2024,3300623,35.0,97723674.0,52.0,86.0,AA,Eleventh Hour Games,Eleventh Hour Games,899770
7 Days to Die,25-07-2024,9877443,45.0,89781931.0,85.9,89.0,AA,The Fun Pimps Entertainment LLC,The Fun Pimps,251570
V Rising,08-05-2024,4784609,35.0,83614738.0,32.7,,AA,Stunlock Studios,Stunlock Studios,1604030
Manor Lords,26-04-2024,2294915,40.0,63098408.0,16.7,88.0,AA,Hooded Horse,Slavic Magic,1363080


###### Show bottom-10 games by revenue

In [119]:
%%sql
SELECT
    *
FROM
    steam_revenue
ORDER BY
    revenue
LIMIT 10
;

name,releaseDate,copiesSold,price,revenue,avgPlaytime,reviewScore,publisherClass,publishers,developers,steamId
Memories Off #5 Togireta Film,31-01-2024,1778,15.0,20674.0,7.3,91,AA,"Spike Chunsoft Co., Ltd.",MAGES. Inc.,2184570
Claw Machine Sim,28-03-2024,3896,7.0,20723.0,2.9,94,Indie,Unechte Sachen,Unechte Sachen,2456120
DYSCHRONIA: Chronos Alternate - Dual Edition,27-03-2024,725,35.0,20922.0,6.3,92,Indie,IzanagiGames,"IzanagiGames,MyDearest Inc.",2023920
Megacopter: Blades of the Goddess,21-06-2024,1684,16.0,20946.0,3.6,95,Indie,Pizza Bear Games,Pizza Bear Games,1228360
Champion Shift,21-06-2024,4526,7.0,20955.0,6.6,79,Indie,SRG Studios,SRG Studios,2391900
Orc Covenant: Gay Bara Orc Visual Novel,21-08-2024,864,30.0,21022.0,3.9,75,Indie,Y Press Games,Y Press Games,2243570
Lucky Mark,19-04-2024,2505,10.0,21066.0,9.8,58,Indie,Super Alex,Super Alex,2450720
Dungeon Looter,13-05-2024,3205,12.0,21067.0,6.3,78,Indie,Wappen Games,Wappen Games,1228320
The Wandering Corinne,18-07-2024,2891,9.0,21082.0,0.8,54,Indie,"Mango Party,Mango Party News",ankoku marimokan,2311840
Deathwish Enforcers Special Edition,14-02-2024,1489,20.0,21086.0,1.9,87,Indie,Monster Bath Games Inc.,Monster Bath Games Inc.,2683030


There's a few peculiarities to address from the top and bottom 10:

1) There's one game ('The First Descendent', top 10) that has a price of 0 (and there are likely others as well)

2) avgPlaytime has very high decimal precision for some reason.

3) The prices are listed as \\$(n).99 instead of just \\$(n+1).

4) Some games have a review score of 0

We can address each of these:

1) Prices of 0 should be kept as 0 (some economic models for video games can be to keep the game itself free with additional purchases possible - after all, one such game is in the top 10 for 2024!)

2) We can take the precision just out to the first decimal point (making the uncertainty 0.1 hours, or 6 minutes, which really shouldn't make a difference)

3) For prices ending in .99, we can just round those up to the next dollar amount - but we'll keep other values as their original decimal amount!

4) It's not immediately clear why some games have a review score of 0 - it doesn't seem like it could be a tight release date as some games in the top 10 are released July / August and have defined scores whereas the 5th highest-in-revenue game ('Dragon\'s Dogma 2') was released in March and yet has a score of 0. We'll redefine review scores of 0 to be Null.

###### Modifying prices with \*.99 to (\*+1)

In [5]:
%%sql
UPDATE 
    steam_revenue
SET 
    price = CAST(price AS INTEGER) + 1
WHERE 
    (price LIKE '%.99')
;

###### Reduce precision of avgPlaytime to single decimal

In [6]:
%%sql
UPDATE
    steam_revenue
SET
    avgPlaytime = ROUND(avgPlaytime,1)
;

###### Changing reviewScore = 0 -> Null

In [7]:
%%sql
UPDATE
    steam_revenue
SET
    reviewScore = Null
WHERE
    reviewScore = 0
;

Let's next see how granular the publisherClass, publisher, and developer columns are:

###### Counting the number of publisher classes and their class percentage

In [179]:
%%sql
WITH pub_class AS (
    SELECT 
        publisherClass,
        COUNT(*) as count_
    FROM
        steam_revenue
    GROUP BY
        publisherClass
),
total_games AS (
    SELECT
        COUNT(*) AS num_games
    FROM
        steam_revenue
)
SELECT
    pc.publisherClass,
    pc.count_,
    ROUND(pc.count_ * 100.0 / tg.num_games, 2) AS games_percentage
FROM
    pub_class pc,
    total_games tg
GROUP BY
    publisherClass
ORDER BY
    pc.count_ DESC
;

publisherClass,count_,games_percentage
Indie,1301,86.79
AA,146,9.74
AAA,52,3.47


There's only one game classified under 'Hobbyist' (which may not be that different from other Indie games) - we'll drop that one row, reducing us from 1500 to 1499 rows

###### Reducing the number of publisher classes from 4 (AAA [52], AA [146], Indie [1301], Hobbyist [1]) to 3 by removing the single 'Hobbyist' entry

In [9]:
%%sql
DELETE FROM
    steam_revenue
WHERE
    publisherClass = 'Hobbyist'
;

###### Counting the number of particular publishers (not classes)

In [10]:
%%sql
SELECT
    COUNT(DISTINCT publishers) as num_publishers
FROM
    steam_revenue
;

num_publishers
1131


In [120]:
%%sql
SELECT
    publishers, COUNT(*) as count_
FROM
    steam_revenue
GROUP BY
    publishers
ORDER BY
    count_ DESC
LIMIT 10
;

publishers,count_
Kagura Games,17
Electronic Arts,16
072 Project,14
Ubisoft,13
"Mango Party,Mango Party News",11
indie.io,10
Shiravune,10
"Spike Chunsoft Co., Ltd.",9
Nacon,9
Lust Desires 🖤,9


###### Counting the number of developers

In [11]:
%%sql
SELECT
    COUNT(DISTINCT developers) as num_developers
FROM
    steam_revenue
;

num_developers
1405


In [124]:
%%sql
SELECT
    developers, COUNT(*) as count_
FROM
    steam_revenue
GROUP BY
    developers
ORDER BY
    count_ DESC
LIMIT 10
;

developers,count_
Lust Desires 🖤,9
MAGES. Inc.,7
EA Los Angeles,7
Square Enix,6
"CAPCOM Co., Ltd.",5
Octo Games,4
FreeMind S.A.,4
BanzaiProject,4
Ubisoft Mainz,3
Romantic Room,3


# --- Analysis --- #

With the relevant pre-processing done, we can ask some questions about the data and perform some analyses. Consider the following questions from the perspective of someone wanting to invest in a particular publisher or developer:

**1) What is the statistical breakdown of revenue by publisher class and what would be the best one to invest in?**

**2) For developers with multiple releases, which would be good ones to invest in?**

(It would also be interesting to see if there are times of the year it's more profitable to release a game, but we can't statistically confirm that with just a year 2024 dataset unfortunately).

## Q1 : What is the statistical breakdown of revenue by publisher class and what would be the best one to invest in? ##

First we'll find out the average, minimum, and maximum revenues (in millions of dollars) per publisher class.

It should stressed here and throughout that these are *revenue* statistics, and **not profit**. The earnings will likely be higher for AA and AAA publishers vs Indie but we're not told the production costs involved (for which AAA and AA would almost always be higher than an Indie game).

###### Avg, Max, Min revenue (in millions of dollars) by publisher class

In [67]:
%%sql
SELECT
    publisherClass,
    ROUND(AVG(revenue) / 1000000, 2) as avg_revenue_M,
    ROUND(MAX(revenue) / 1000000, 2) as max_rev_M,
    ROUND(MIN(revenue) / 1000000, 3) as min_rev_M
FROM 
    steam_revenue
GROUP BY
    publisherClass
ORDER BY
    avg_revenue_M DESC
;

publisherClass,avg_revenue_M,max_rev_M,min_rev_M
AAA,30.51,837.79,0.026
AA,10.17,392.33,0.021
Indie,0.67,34.53,0.021


Rather unsurprisingly, publishers with greater resources see greater revenue. The average revenue brought in by a AAA publisher is 10x higher than that of a AA and 45x an Indie game. The AAA max is much higher than AA, but looking back to the top-10 table shows that this is an outlier as the 2nd highest is also AAA and earned \\$436M (making it much closer to the AA max). Notably the minimum earnings are all within similar ranges (\\$26K for AAA and \\$21K for both AA and Indie) - but recall that this dataset comprises the top 1500 earning Steam games so there are very likely lower minimums not seen.

[We would have analyzed the median as well, but sqlite unfortunately does not implement it for aggregation with GROUP BY]

We can perhaps get better insight by converting revenue from a continuous variable (in \$'s) to an ordinal one (e.g. <500K, 500KTo1M, ...). We'll do this and then express the data as a contingency table over the publisher class and ordinal revenue class (by percentage). We'll look at two contingency tables: One where the percentage is shown relative to the revenue ordinal classiciation, and another where it's relative to the publisherClass.

###### Calculating publisher classes across <u>ordinal revenue percentage</u> (<\\$500K, \\$500K-\\$1M, \\$1M-\\$10M, and >\\$10M) 

In [197]:
%%sql
SELECT  
    publisherClass,
    ROUND(
        SUM(CASE WHEN revenue <= 500*1000 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1
    ) AS "<$500K",
    ROUND(
        SUM(CASE WHEN revenue > 500*1000 AND revenue <= 1000*1000 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1
    ) AS "$500K-\$1M",
    ROUND(
        SUM(CASE WHEN revenue > 1000*1000 AND revenue <= 10*1000*1000 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1
    ) AS "$1M-\$10M",
    ROUND(
        SUM(CASE WHEN revenue > 10*1000*1000 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1
    ) AS ">$10M"
FROM
    steam_revenue
GROUP BY
    publisherClass
ORDER BY
    "<$500K"
;

publisherClass,<$500K,$500K-\$1M,$1M-\$10M,>$10M
AAA,36.5,23.1,25.0,15.4
AA,45.2,13.7,30.8,10.3
Indie,81.5,6.9,10.4,1.2


In statistical verbiage, this table expresses the probability of the revenue bracket ('rB') given the publisherClass ('pC'), or $prob (rB | pC)$. Here are some observations from the table:

1) The order of largest percentages to smallest is the same for all publisher classes: <\\$500K, \\$1M-\\$10M, \\$500K-\\$1M, then >\\$10M.

2) Although point (1) is true, the percentage proportions are not the same between classes. 80\% of Indie games fall into the <\\$500K bracket and the percentage decreases for an increase in publisher class (AA can be regarded as "higher" than Indie, and AAA higher than AA). However, the percentages increase with publisher class in the \\$500K-\\$1M and >\\$10M brackets.

Something we might ask is "If I select a publisher class to invest in, what's the chance I'll be in Y revenue bracket <u>or higher</u>? We could add the percentages manually in this text block, but it would be easier if we reexpressed them as cumulative percentages relative to decreasing revenue bracket:

###### Calculating publisher classes across ordinal revenue <u>cumulative</u> percentages

In [208]:
%%sql
WITH cumulative_revenue AS (
    SELECT  
        publisherClass,
        ROUND(
            SUM(CASE WHEN revenue <= 500*1000 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1
        ) AS "<$500K",
        ROUND(
            SUM(CASE WHEN revenue > 500*1000 AND revenue <= 1000*1000 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1
        ) AS "$500K-1M",
        ROUND(
            SUM(CASE WHEN revenue > 1000*1000 AND revenue <= 10*1000*1000 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1
        ) AS "$1M-10M",
        ROUND(
            SUM(CASE WHEN revenue > 10*1000*1000 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1
        ) AS ">$10M"
    FROM
        steam_revenue
    GROUP BY
        publisherClass
)
SELECT
    publisherClass,
    ROUND(100, 2) AS "<$500K",
    ROUND(100 - ("<$500K"), 2) AS "$500K-1M",
    ROUND(100 - ("$500K-1M" + "<$500K"), 2) AS "$1M-10M",
    ROUND(100 - ("$1M-10M" + "$500K-1M" + "<$500K"), 2) AS ">$10M"
FROM
    cumulative_revenue
ORDER BY
    ">$10M" DESC;

publisherClass,<$500K,$500K-1M,$1M-10M,>$10M
AAA,100.0,63.5,40.4,15.4
AA,100.0,54.8,41.1,10.3
Indie,100.0,18.5,11.6,1.2


What stands out as pretty surprising is that the chances of a AA publisher earning >\\$1M in revenue is just barely larger than AAA! AA ususally has less production costs than AAA, so if there is more preference to make a safe investment versus maximizing chances of high-earning, it might be better to go with AA publishers.

###### Calculating <u>publisher class percentage</u> across ordinal revenue (<\\$500K, \\$500K-\\$1M, \\$1M-\\$10M, and >\\$10M) 

In [209]:
%%sql
WITH total_revenue_brackets AS (
    SELECT  
        SUM(CASE WHEN revenue <= 500*1000 THEN 1 ELSE 0 END) AS total_500K,
        SUM(CASE WHEN revenue > 500*1000 AND revenue <= 1000*1000 THEN 1 ELSE 0 END) AS total_500K_1M,
        SUM(CASE WHEN revenue > 1000*1000 AND revenue <= 10*1000*1000 THEN 1 ELSE 0 END) AS total_1M_10M,
        SUM(CASE WHEN revenue > 10*1000*1000 THEN 1 ELSE 0 END) AS total_10M
    FROM
        steam_revenue
)
SELECT  
    publisherClass,
    ROUND(
        SUM(CASE WHEN revenue <= 500*1000 THEN 1 ELSE 0 END) * 100.0 / (SELECT total_500K FROM total_revenue_brackets), 1
    ) AS "<$500K",
    ROUND(
        SUM(CASE WHEN revenue > 500*1000 AND revenue <= 1000*1000 THEN 1 ELSE 0 END) * 100.0 / (SELECT total_500K_1M FROM total_revenue_brackets), 1
    ) AS "$500K-\$1M",
    ROUND(
        SUM(CASE WHEN revenue > 1000*1000 AND revenue <= 10*1000*1000 THEN 1 ELSE 0 END) * 100.0 / (SELECT total_1M_10M FROM total_revenue_brackets), 1
    ) AS "$1M-\$10M",
    ROUND(
        SUM(CASE WHEN revenue > 10*1000*1000 THEN 1 ELSE 0 END) * 100.0 / (SELECT total_10M FROM total_revenue_brackets), 1
    ) AS ">$10M"
FROM
    steam_revenue
GROUP BY
    publisherClass
ORDER BY
    "<$500K"
;

publisherClass,<$500K,$500K-\$1M,$1M-\$10M,>$10M
AAA,1.7,9.8,6.7,20.5
AA,5.8,16.4,23.3,38.5
Indie,92.6,73.8,69.9,41.0


Now we're seeing the probability (expressed as percentage) of publisherClass given revenue bracket, or $prob (pC | rB)$. Here are some observations:

1) For each bracket, we see a decrease in the percentage publisherClass as publisherClass increases (increasing again meaning from Indie -> AA -> AAA).

2) The percentage proportions are not the same across brackets. The <\\$500K is dominated by the Indie publishers, with only 6\% and 2\% of it belonging to AA and AAA games. This tends to become more uniformly distributed as the revenue bracket increase. The >\\$10M bracket even shows comparable chances between an Indie publisher vs AA.

# Answer to Q1 : What is the statistical breakdown of revenue by publisher class and what would be the best one to invest in?

1) **The average revenue brought in by Indie publishers is \\$0.67M with AA 10x and AAA 45x higher.** The minimum revenue for AA and Indie publishers was about \\$21K with \\$26K being the minimum for AAA (but this is limited by the data we're given - which is just the top 1500 earning steam sales between January and August 2024).

2) **For the probability of the revenue bracket given the publisher class ( $prob (rB | pC)$ ), we found that there was slightly higher chances of a AA game earning >\$1M in revenue than a AAA publisher**. While we're not given the profits or the production costs associated with the developers or publishers, AA publishers generally have less costs than AAA so there's a <u>reduced risk</u> in generating a higher profit by investing with a AA publisher.

3) **The probability of the publisher class given the revenue bracket ( $prob (pC | rB)$ ) showed that the highest revenue bracket was evenly split between AA and Indie publishers.** Like costs between AA and AAA, Indie publishers usually have lowers costs than AA (sometimes even much lower). These can be invested in a sort of 'long-shot' investments: Great if they win big, but no large loss if they break even or fail.

**Overall recommendation:** Primarily invest in AA publishers due to their demonstrated capacity to earn into the higher revenue brackets while having lower operational costs than AAA. If additional funds are available, then diversify and invest in a variety of Indie publishers due to their likely lower operational costs.

## Q2 : For developers with multiple releases, which would be good ones to invest in? ##

Let's filter out any developers that only released one game in this dataset:

###### Counting the number of <u>developers</u>  who have released 2 or more games in this dataset and the total games they've made:

In [126]:
%%sql
WITH game_count_per_developer AS (
    SELECT
        developers,
        COUNT(*) AS game_count
    FROM 
        steam_revenue
    GROUP BY
        developers
)
SELECT 
    COUNT(*) AS num_developers,
    SUM(game_count) AS total_games
FROM
    game_count_per_developer
WHERE
    game_count >= 2
;

num_developers,total_games
58,151


###### average, min, and max revenue (in thousands of dollars) per <u>developer</u> and sorted by <u>highest minimum</u> revenue:

In [157]:
%%sql
WITH game_revenue AS (
    SELECT 
        developers, 
        revenue
    FROM
        steam_revenue
),
developer_stats AS (
    SELECT
        developers,
        COUNT(*) AS game_count,
        AVG(revenue) AS avg_revenue_,
        MAX(revenue) AS max_revenue_,
        MIN(revenue) AS min_revenue_
    FROM
        game_revenue
    GROUP BY
        developers
)
SELECT
    ROW_NUMBER() OVER (ORDER BY min_revenue_ DESC) AS min_revenue_rank,
    developers, 
    ROUND(avg_revenue_/1000,2) as avg_revenue_K, 
    ROUND(max_revenue_/1000,2) as max_revenue_K,
    ROUND(min_revenue_/1000,2) as min_revenue_K,
    game_count
FROM
    developer_stats
WHERE
    game_count >= 2
ORDER BY
    min_revenue_K DESC
LIMIT 10
;

min_revenue_rank,developers,avg_revenue_K,max_revenue_K,min_revenue_K,game_count
1,Visual Concepts,9439.01,10822.92,8055.1,2
2,ATLUS,21115.56,34601.87,7629.25,2
3,EA Los Angeles,931.03,961.99,920.02,7
4,"CAPCOM Co., Ltd.",23576.21,111478.29,683.95,5
5,Square Enix,2110.34,6828.45,464.42,6
6,WAKUWAKU,546.49,638.5,454.49,2
7,Aspyr,3244.51,6048.34,440.68,2
8,Taboo Tales 💘,319.11,388.95,249.27,2
9,Slitherine Ltd.,1137.39,2060.09,214.69,2
10,Cyanide Studio,459.23,709.83,208.63,2


The table above shows the average, maximum, and minimum revenue (in thousands of dollars) per multi-release developer, the amount of games they developed ('game_count') and their rank according to highest minimum revenue ('min_revenue_rank'). We're interested in finding good developers to invest in, so we're sorting these according to highest minimum. That way, we're less likely to simply pick a developer that had one lucky or break-out title but then fails to earn in their other games. There are several observations to make:

1) The 1st row is 'Visual Concepts', who have only developed 2 games, but both earned fairly judging from the minimum and maximum revenue with the lesser one pulling in \\$8M. A developer releasing 2 games in this cycle with the least-earning being that high is a positive sign.

2) The 2nd row is 'ATLUS' and they have also released 2 games. They're average revenue was \\$21M and their lowest-earning game pulled in \\$7.6M but the higher-earning one was much higher than 'Visual Concepts' at \\$34.6M.

3) EA Los Angeles (3rd row) is interesting because they have developed many games (7 in this dataset) which have all pulled in comparable revenue between \\$920K and \\$960K. Dedicating funds here could be viewed as a low risk - low reward investment. It's likely to earn well based on past progress, but seems very unlikely to make anything very high earning.

4) CAPCOM Co., Ltd. (row 4) developed 5 games and has a pretty wide revenue distribution (low min, very high max), so we should analyze their releases in more detail.

###### Show the 5 games developed by CAPCOM Co., Ltd., their revenue (in thousands of dollars), and the percentage of revenue that each game earned across the 5.

In [152]:
%%sql
WITH capcom_revenue AS (
    SELECT
        name,
        revenue / 1000 as revenue_K
    FROM
        steam_revenue
    WHERE
        developers LIKE 'CAPCOM%'
),
total_capcom_revenue AS (
    SELECT
        SUM(revenue_K) AS total_rev_K
    FROM
        capcom_revenue
)
SELECT
    cr.name,
    cr.revenue_K,
    ROUND(100 * cr.revenue_K / tr.total_rev_K, 1) AS revenue_percentage
FROM
    capcom_revenue cr,
    total_capcom_revenue tr
;

name,revenue_K,revenue_percentage
Apollo Justice: Ace Attorney Trilogy,3465.018,2.9
Kunitsu-Gami: Path of the Goddess,1276.574,1.1
Ace Attorney Investigations Collection,977.191,0.8
Monster Hunter Stories,683.951,0.6
Dragon's Dogma 2,111478.291,94.6


And now it's clearer - the distribution is so wide because among these 5 titles we see there are 2 titles that earned < \\$1M (a cumulative 1.4\% of earnings), another 2 titles that earned \\$1.2M and \\$3.5M (cumulatively 4\%), and the last title earned \\$111M - almost 95\%! So they made some okay-earning releases, another couple that were a few million, and an enourmous amount in a single game.

# Answer to Q2 : For developers with multiple releases, which would be good ones to invest in?

1) **A stable, low risk - low reward investment: EA Los Angeles.** Generally always close to \\$1M in revenue per game across 7 games.

2) **Moderate risk - moderate reward: Visual Concepts.** Developed 2 games with \\$8M and \\$10.8M in revenue.

3) **Moderate risk - high reward: ATLUS.** Developed 2 games with \\$7.6M and \\$34.6M in revenue. Seems more likely to create a much higher earning game than Visual Concepts and yet experience similar minimum earnings - but this is using a very small sample size (2).

4) **High risk - <u>very</u> high reward: CAPCOM Co., Ltd.** Developed 5 games with 2 earning <\\$1M, another 2 earning \\$1.3M and \\$3.5M, and a blow-out title bringing in >\\$100M - which is the 5th highest earning title in the dataset.

# --- Final Notes ---

# Saving the modified table to .csv

The python code block below was used to convert the modified sqlite table to a pandas dataframe, which was then saved to the folder as 'steam_revenue_cleaned.csv'. Note that this table reduced the precision on some of the floats, set some values to Null, and removed the single 'Hobbyist' publisher class entry (reducing the dataset size from 1500 to 1499).

The code to do this is commented out in case someone wanted to run the notebook from scratch. Remove the block comments if you wish to save the dataframe (but the original result from the first run should already be in the folder).

In [136]:
"""
# Run the command to fully select table from database
table_to_df = %sql SELECT * FROM steam_revenue
# Convert the SQL result into a Pandas DataFrame
modified_df = table_to_df.DataFrame()
# Then finally export it as csv
modified_df.to_csv('steam_revenue_cleaned.csv', index=False)
"""

### Deeper statistical questions / R

There are actually deeper statistical questions to ask concerning this dataset, such as 

1) What's the relationship between reviewScore and revenue? Does a game with high reviewScore necessarily earn well?

2) Do the free-to-play games (those with a price of \\$0) generally experience higher revenue than those that are not?

3) Between the publisher classes, which would be the **statistically optimal** one to invest in? We have a rough idea from our contingency table earlier, but we can definitely get more exact.

But it's better to address these using more robust statistically-oriented software. We'll create an R markdown document and investigate further using R.

### Tableau Dashboard

A dashboard going through a similar analysis can be found on the public Tableau server at https://tinyurl.com/3f3yyydu.