# Video Games Review: Analysis

In this project, I explore [Metacritic's video games reviews](https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?page=0), released between 1995-2021.

Please see the [README](https://github.com/henrylin03/video-games) for more information.


## Setup

In this section, I import all necessary libraries, setup the SQLite database, and then import the input `.csv` files in a format that allows manipulation by both `sqlite` and `pandas`:


In [1]:
# import necessary libraries
import os
import pandas as pd
from sqlalchemy import create_engine

# creating SQLite database
engine = create_engine("sqlite://", echo=False)

# import CSV as DataFrames and create SQL tables
INPUT_PATH = "./input"
meta_df = pd.read_csv(os.path.join(INPUT_PATH, "meta.csv"))
meta_df.to_sql("meta", engine, if_exists="replace", index=False)

user_df = pd.read_csv(os.path.join(INPUT_PATH, "user.csv"))
user_df.to_sql("user", engine, if_exists="replace", index=False)


18558

## Cleaning

Before analysis, the following steps should be taken:

1. Merge the two tables.
2. Clean the `platform` column by removing the `"Platform: "` prefix.
3. Ensuring correct data types -- namely: `release_date` should be in datetime format.

_Missing values?
Duplicate values?
Extract awards from the summary?_


### Merging Tables

As there are some games that have a Metascore but no User Score due to insufficient reviews (and vice versa), we need to join the two tables together.

Here, we use `pandas.merge()` method as SQLite does not have `FULL OUTER JOIN`s _([SQLite Tutorial](https://www.sqlitetutorial.net/sqlite-full-outer-join/))_.


In [2]:
merged_df = meta_df.merge(user_df, how="outer", on=["name", "platform", "release_date"])

merged_df.head(3)


Unnamed: 0,meta_score,meta_rank,name,platform,release_date,summary_x,user_score,user_rank,summary_y
0,99,1.0,The Legend of Zelda: Ocarina of Time,Platform: Nintendo 64,"November 23, 1998","As a young boy, Link is tricked by Ganondorf, ...",9.1,84.0,"As a young boy, Link is tricked by Ganondorf, ..."
1,98,2.0,Tony Hawk's Pro Skater 2,Platform: PlayStation,"September 20, 2000",As most major publishers' development efforts ...,7.4,8651.0,As most major publishers' development efforts ...
2,98,3.0,Grand Theft Auto IV,Platform: PlayStation 3,"April 29, 2008",[Metacritic's 2008 PS3 Game of the Year; Also ...,7.8,5567.0,[Metacritic's 2008 PS3 Game of the Year; Also ...


In [3]:
# taking only the `summary` column that has values - if neither have values, then leave empty
if not merged_df[merged_df.summary_y.isna() & ~merged_df.summary_x.isna()].empty:
    print('There are rows with a "summary" in column "summary_x" and not "summary_y"')
if not merged_df[merged_df.summary_x.isna() & ~merged_df.summary_y.isna()].empty:
    print('There are rows with a "summary" in column "summary_y" and not "summary_x"')

# thus, we can safely drop the summary_y column
merged_df = merged_df.drop(columns=["summary_y"]).rename(
    columns={"summary_x": "summary"}
)
merged_df.head(3)


There are rows with a "summary" in column "summary_x" and not "summary_y"


Unnamed: 0,meta_score,meta_rank,name,platform,release_date,summary,user_score,user_rank
0,99,1.0,The Legend of Zelda: Ocarina of Time,Platform: Nintendo 64,"November 23, 1998","As a young boy, Link is tricked by Ganondorf, ...",9.1,84.0
1,98,2.0,Tony Hawk's Pro Skater 2,Platform: PlayStation,"September 20, 2000",As most major publishers' development efforts ...,7.4,8651.0
2,98,3.0,Grand Theft Auto IV,Platform: PlayStation 3,"April 29, 2008",[Metacritic's 2008 PS3 Game of the Year; Also ...,7.8,5567.0


### Removing string prefix in `platform` column
As all values in the `platform` column has a `"Platform: "` prefix, I use SQL's `SUBSTR()` to begin the string at a later index.

If the the unwanted substring is not consistently a prefix, I would use `REPLACE()` instead. 

In [4]:
# displaying distinct values of `platform` column to confirm all values start with prefix
merged_df.to_sql("games", engine, if_exists="replace", index=False)
pd.read_sql_query("SELECT DISTINCT platform FROM games", engine)

Unnamed: 0,platform
0,Platform: Nintendo 64
1,Platform: PlayStation
2,Platform: PlayStation 3
3,Platform: Dreamcast
4,Platform: Xbox 360
5,Platform: Wii
6,Platform: Xbox One
7,Platform: PC
8,Platform: Switch
9,Platform: PlayStation 2


In [5]:
merged_df = pd.read_sql_query(
    """
    SELECT meta_rank,
        meta_score,
        user_rank,
        user_score,
        name,
        SUBSTR(platform, 11) platform,
        release_date,
        summary
    FROM 
        games
        """,
    engine,
)
merged_df

Unnamed: 0,meta_rank,meta_score,user_rank,user_score,name,platform,release_date,summary
0,1.0,99,84.0,9.1,The Legend of Zelda: Ocarina of Time,Nintendo 64,"November 23, 1998","As a young boy, Link is tricked by Ganondorf, ..."
1,2.0,98,8651.0,7.4,Tony Hawk's Pro Skater 2,PlayStation,"September 20, 2000",As most major publishers' development efforts ...
2,3.0,98,5567.0,7.8,Grand Theft Auto IV,PlayStation 3,"April 29, 2008",[Metacritic's 2008 PS3 Game of the Year; Also ...
3,4.0,98,2034.0,8.4,SoulCalibur,Dreamcast,"September 8, 1999","This is a tale of souls and swords, transcendi..."
4,5.0,98,4189.0,8.0,Grand Theft Auto IV,Xbox 360,"April 29, 2008",[Metacritic's 2008 Xbox 360 Game of the Year; ...
...,...,...,...,...,...,...,...,...
19970,19970.0,17,18159.0,3.3,Vroom in the Night Sky,Switch,"April 5, 2017",Vroom in the night sky is a magical bike actio...
19971,19971.0,17,18472.0,1.9,Leisure Suit Larry: Box Office Bust,PlayStation 3,"May 5, 2009",The Leisure Suit Larry: Box Office Bust video ...
19972,19972.0,17,17546.0,4.4,Yaris,Xbox 360,"October 10, 2007",[Xbox Live Arcade] Hop into a Toyota Yaris and...
19973,19973.0,16,18532.0,1.4,Ride to Hell: Retribution,PC,"June 24, 2013",The game is set in the last years of the roari...


### Ensuring Datetime data type
As SQLite does not support month names ([StackOverflow](https://stackoverflow.com/questions/1181123/date-formatting-from-sqlite-query)) as is found in the `release_date` column, we use [`pandas.to_datetime()`](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) method instead:

In [6]:
merged_df["release_date"] = pd.to_datetime(merged_df.release_date, format="%B %d, %Y")
merged_df.to_sql("games", engine, if_exists="replace", index=False)
merged_df.head(2)


Unnamed: 0,meta_rank,meta_score,user_rank,user_score,name,platform,release_date,summary
0,1.0,99,84.0,9.1,The Legend of Zelda: Ocarina of Time,Nintendo 64,1998-11-23,"As a young boy, Link is tricked by Ganondorf, ..."
1,2.0,98,8651.0,7.4,Tony Hawk's Pro Skater 2,PlayStation,2000-09-20,As most major publishers' development efforts ...


### Missing Values
Though there are missing values in the dataset, there is NFA to impute as:
1. The `user_rank` and `user_score` columns are missing if there are <7 user reviews for that game, and is difficult to impute.
2. The `summary` of the game is authored by Metacritic, which is also difficult to impute.

In [7]:
pd.read_sql_query(
    """
    SELECT SUM(CASE WHEN meta_rank IS NULL THEN 1 ELSE 0 END) meta_rank_miss,
        SUM(CASE WHEN meta_score IS NULL THEN 1 ELSE 0 END) meta_score_miss,
        SUM(CASE WHEN user_rank IS NULL THEN 1 ELSE 0 END) user_rank_miss,
        SUM(CASE WHEN user_score IS NULL THEN 1 ELSE 0 END) user_score_miss,
        SUM(CASE WHEN name IS NULL THEN 1 ELSE 0 END) name_miss,
        SUM(CASE WHEN platform IS NULL THEN 1 ELSE 0 END) platform_miss,
        SUM(CASE WHEN release_date IS NULL THEN 1 ELSE 0 END) release_date_miss,
        SUM(CASE WHEN summary IS NULL THEN 1 ELSE 0 END) summary_miss
    FROM games
    """,
    engine,
)


Unnamed: 0,meta_rank_miss,meta_score_miss,user_rank_miss,user_score_miss,name_miss,platform_miss,release_date_miss,summary_miss
0,0,0,1417,1417,0,0,0,94


### Duplicates

In [11]:
# check duplicates for "name" and "summary" columns, as the other attributes can be expected to have duplicates
for col in ["name", "summary"]:
    dup_df = pd.read_sql_query(
        f"""
        SELECT {col},
            COUNT(*) counts
        FROM games
        WHERE {col} NOT NULL
        GROUP BY 1
        HAVING counts > 1
        ORDER BY 2 DESC
        """,
        engine,
    )
    if not dup_df.empty:
        display(col, dup_df)


'name'

Unnamed: 0,name,counts
0,Madden NFL 07,9
1,Madden NFL 06,8
2,X-Men: The Official Game,7
3,Tiger Woods PGA Tour 07,7
4,The Sims 2,7
...,...,...
4131,1942: Joint Strike,2
4132,187 Ride or Die,2
4133,13 Sentinels: Aegis Rim,2
4134,10 Second Ninja X,2


'summary'

Unnamed: 0,summary,counts
0,The game involves players creating and destroy...,7
1,Need for Speed ProStreet accelerates street ra...,7
2,LEGO Indiana Jones: The Original Adventures ta...,7
3,"Iron Man, an explosive third-person action sho...",7
4,"In Harry Potter and the Order of the Phoenix, ...",7
...,...,...
2987,'DARK SOULS II Crown of the Ivory King' is the...,2
2988,"""Street Fighter"" is best known for its well-po...",2
2989,"""Our goal was to create a game that is perfect...",2
2990,"""Only the possibility of you can change our fa...",2


#### Duplicate names
The `names` column has duplicates largely due to each row representing a game for a particular _platform_. For example, if a game was released on PC, PS4 and PS5 -- each of these will have its own row.



In [19]:
# join other attributes to review potential causes of duplication in `name` column
pd.read_sql_query(
    """
    WITH dups_name AS (
        SELECT name, COUNT(*) counts
        FROM games
        GROUP BY 1 
        HAVING counts > 1
    )
    
    SELECT d.name name_dups,
        g.meta_rank,
        g.meta_score,
        g.user_rank,
        g.user_score,
        g.platform,
        g.release_date,
        g.summary
    FROM dups_name d
    JOIN games g
        USING (name)
    ORDER BY 1
    """,
    engine,
)


Unnamed: 0,name_dups,meta_rank,meta_score,user_rank,user_score,platform,release_date,summary
0,.hack//G.U. Last Recode,7351.0,76,12170.0,6.7,Switch,2022-03-10 00:00:00.000000,.hack//G.U. is back! Log back into the .hack//...
1,.hack//G.U. Last Recode,7816.0,76,3285.0,8.1,PlayStation 4,2017-11-03 00:00:00.000000,.hack//G.U. is back. Log back into the .hack//...
2,.hack//G.U. Last Recode,12454.0,69,4596.0,7.9,PC,2017-11-03 00:00:00.000000,.hack//G.U. is back! Log back into the .hack//...
3,007 Legends,19241.0,45,17422.0,4.5,Xbox 360,2012-10-16 00:00:00.000000,"007 Legends features an original, overarching ..."
4,007 Legends,19544.0,41,17750.0,4.1,PlayStation 3,2012-10-16 00:00:00.000000,Each of the 007 Legends movie-inspired mission...
...,...,...,...,...,...,...,...,...
11202,kill.switch,9365.0,73,3080.0,8.2,PlayStation 2,2003-10-28 00:00:00.000000,"In a world on the brink of global conflict, yo..."
11203,kill.switch,14115.0,66,11593.0,6.9,PC,2004-03-30 00:00:00.000000,"In a world on the brink of global conflict, yo..."
11204,nail'd,12297.0,69,6961.0,7.6,PC,2010-11-30 00:00:00.000000,nail’d is all about eschewing boring realism f...
11205,nail'd,13992.0,66,14321.0,6.2,PlayStation 3,2010-11-30 00:00:00.000000,nail’d is all about eschewing boring realism f...


In [21]:
# confirm hypothesis that duplication in `names` column caused by different platforms
pd.read_sql_query(
    """
        SELECT name,
            platform,
            COUNT(*) as counts
        FROM games
        GROUP BY 1, 2
        HAVING counts > 1
                """,
    engine,
)


Unnamed: 0,name,platform,counts
0,Ninja Gaiden: Master Collection,Xbox One,2


In [22]:
merged_df[merged_df.name=="Ninja Gaiden: Master Collection"]

Unnamed: 0,meta_rank,meta_score,user_rank,user_score,name,platform,release_date,summary
9873,9874.0,73,10621.0,7.0,Ninja Gaiden: Master Collection,PlayStation 4,2021-06-10,Three extremely high speed and intense ninja a...
10253,10254.0,72,9964.0,7.1,Ninja Gaiden: Master Collection,Xbox One,2021-06-10,Three extremely high speed and intense ninja a...
10254,10254.0,72,10622.0,7.0,Ninja Gaiden: Master Collection,Xbox One,2021-06-10,Three extremely high speed and intense ninja a...
10563,10563.0,72,9375.0,7.2,Ninja Gaiden: Master Collection,Switch,2021-06-10,Three extremely high speed and intense ninja a...
10704,10704.0,71,9965.0,7.1,Ninja Gaiden: Master Collection,PC,2021-06-09,Three extremely high speed and intense ninja a...


#### Duplicate summaries