# Video Games Review: Analysis

In this project, I explore [Metacritic's video games reviews](https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?page=0), released between 1995-2021.

Please see the [README](https://github.com/henrylin03/video-games) for more information.


## Setup

In this section, I import all necessary libraries, setup the SQLite database, and then import the input `.csv` files in a format that allows manipulation by both `sqlite` and `pandas`:


In [52]:
import os
import pandas as pd
from sqlalchemy import create_engine

# creating SQLite database
engine = create_engine("sqlite://", echo=False)

In [58]:
from glob import glob

INPUT_PATH = "./input"
csv_paths = glob(os.path.join(INPUT_PATH, "*.csv"))

dfs = {}
for c in csv_paths:
    c_cleaned = c[8:].replace(".csv", "")
    dfs[c_cleaned] = pd.read_csv(c).sort_values(["name", "platform"])
    display(c_cleaned, dfs[c_cleaned].head(3))


'meta'

Unnamed: 0,meta_score,meta_rank,name,platform,release_date,summary
12510,69,12511.0,#DRIVE,Platform: Switch,"February 16, 2021",#DRIVE is an endless driving videogame inspire...
6686,77,6687.0,#IDARB,Platform: Xbox One,"January 30, 2015",#IDARB started as a picture of red box. Then m...
18217,53,18218.0,#KILLALLZOMBIES,Platform: Xbox One,"August 10, 2016",In the distant future; #KILLALLZOMBIES is thru...


'user'

Unnamed: 0,user_score,user_rank,name,platform,release_date,summary
11697,6.8,11698.0,#DRIVE,Platform: Switch,"February 16, 2021",#DRIVE is an endless driving videogame inspire...
11391,6.9,11392.0,#IDARB,Platform: Xbox One,"January 30, 2015",#IDARB started as a picture of red box. Then m...
5521,7.8,5522.0,'Splosion Man,Platform: Xbox 360,"July 22, 2009","[Xbox Live Arcade] Fools they were, to tread w..."


## Cleaning

### Removing string prefix in `platform` column

As all values in the `platform` column has a `"Platform: "` prefix, I use SQL's `SUBSTR()` to begin the string at a later index.

If the the unwanted substring is not consistently a prefix, I would use `REPLACE()` instead.


In [24]:
# displaying distinct values of `platform` column to confirm all values start with prefix
merged_df.to_sql("games", engine, if_exists="replace", index=False)
pd.read_sql_query("SELECT DISTINCT platform FROM games", engine)


Unnamed: 0,platform
0,Platform: Switch
1,Platform: Xbox One
2,Platform: Xbox 360
3,Platform: PlayStation 3
4,Platform: PC
5,Platform: PlayStation 4
6,Platform: PlayStation 2
7,Platform: PlayStation
8,Platform: DS
9,Platform: Wii


In [25]:
merged_df = pd.read_sql_query(
    """
    SELECT name,
        SUBSTR(platform, 11) platform,
        release_date,
        summary,
        meta_score,
        user_score
    FROM 
        games
        """,
    engine,
)
merged_df


Unnamed: 0,name,platform,release_date,summary,meta_score,user_score
0,#DRIVE,Switch,"February 16, 2021",#DRIVE is an endless driving videogame inspire...,69,6.7
1,#IDARB,Xbox One,"January 30, 2015",#IDARB started as a picture of red box. Then m...,77,7.6
2,#KILLALLZOMBIES,Xbox One,"August 10, 2016",In the distant future; #KILLALLZOMBIES is thru...,53,3.2
3,'Splosion Man,Xbox 360,"July 22, 2009","[Xbox Live Arcade] Fools they were, to tread w...",84,8.3
4,.detuned,PlayStation 3,"October 15, 2009",[Playstation Network] Developed by .theprodukk...,46,
...,...,...,...,...,...,...
19987,rain,PlayStation 3,"October 1, 2013","Developed by PlayStation C.A.M.P!, the creativ...",72,7.1
19988,theHunter: Call of the Wild,PC,"February 16, 2017",Experience the thrill of the hunt in a vast op...,64,6.0
19989,uDraw Studio,Wii,"November 14, 2010","Packaged with the uDraw GameTablet, uDraw Stud...",71,7.0
19990,void tRrLM(); //Void Terrarium,Switch,"July 14, 2020","In a world contaminated with toxic fungi, a di...",74,7.3


### Ensuring Datetime data type

As SQLite does not support month names ([StackOverflow](https://stackoverflow.com/questions/1181123/date-formatting-from-sqlite-query)) as is found in the `release_date` column, we use [`pandas.to_datetime()`](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) method instead:


In [26]:
merged_df["release_date"] = pd.to_datetime(merged_df.release_date, format="%B %d, %Y")
merged_df.to_sql("games", engine, if_exists="replace", index=False)
merged_df.head(2)


Unnamed: 0,name,platform,release_date,summary,meta_score,user_score
0,#DRIVE,Switch,2021-02-16,#DRIVE is an endless driving videogame inspire...,69,6.7
1,#IDARB,Xbox One,2015-01-30,#IDARB started as a picture of red box. Then m...,77,7.6


### Missing Values

Though there are missing values in the dataset, there is NFA to impute as:

1. The `user_rank` and `user_score` columns are missing if there are <7 user reviews for that game, and is difficult to impute.
2. The `summary` of the game is authored by Metacritic, which is also difficult to impute.


In [27]:
pd.read_sql_query(
    """
    SELECT SUM(CASE WHEN meta_score IS NULL THEN 1 ELSE 0 END) meta_score_miss,
        SUM(CASE WHEN user_score IS NULL THEN 1 ELSE 0 END) user_score_miss,
        SUM(CASE WHEN name IS NULL THEN 1 ELSE 0 END) name_miss,
        SUM(CASE WHEN platform IS NULL THEN 1 ELSE 0 END) platform_miss,
        SUM(CASE WHEN release_date IS NULL THEN 1 ELSE 0 END) release_date_miss,
        SUM(CASE WHEN summary IS NULL THEN 1 ELSE 0 END) summary_miss
    FROM games
    """,
    engine,
)


Unnamed: 0,meta_score_miss,user_score_miss,name_miss,platform_miss,release_date_miss,summary_miss
0,0,1405,0,0,0,8


### Duplicates


In [28]:
# deduplicate first as no entries should be identical for all columns
merged_df = merged_df.drop_duplicates()
merged_df.to_sql("games", engine, if_exists="replace", index=False)

# check duplicates for "name", "summary" columns, as the other attributes can be expected to have duplicates
for col in ["name", "summary"]:
    dup_df = pd.read_sql_query(
        f"""
        SELECT {col},
            COUNT(*) counts
        FROM games
        WHERE {col} NOT NULL
        GROUP BY 1
        HAVING counts > 1
        ORDER BY 2 DESC
        """,
        engine,
    )
    if not dup_df.empty:
        display(col, dup_df)


'name'

Unnamed: 0,name,counts
0,Madden NFL 07,9
1,Madden NFL 06,8
2,X-Men: The Official Game,7
3,Tiger Woods PGA Tour 07,7
4,The Sims 2,7
...,...,...
4135,1942: Joint Strike,2
4136,187 Ride or Die,2
4137,13 Sentinels: Aegis Rim,2
4138,10 Second Ninja X,2


'summary'

Unnamed: 0,summary,counts
0,The game involves players creating and destroy...,7
1,Need for Speed ProStreet accelerates street ra...,7
2,LEGO Indiana Jones: The Original Adventures ta...,7
3,"Iron Man, an explosive third-person action sho...",7
4,"In Harry Potter and the Order of the Phoenix, ...",7
...,...,...
3042,'DARK SOULS II Crown of the Ivory King' is the...,2
3043,"""Street Fighter"" is best known for its well-po...",2
3044,"""Our goal was to create a game that is perfect...",2
3045,"""Only the possibility of you can change our fa...",2


#### Duplicate names

The `names` column has duplicates largely due to each row representing a game for a particular _platform_. Eg: if a game was released on PC, PS4 and PS5 -- each of these will have its own entry.

However, if there are games that also duplicate when the `name` and `platform` columns are grouped, the entry that has incorrect attributes should be removed.

In [29]:
pd.read_sql_query(
    """
    SELECT name,
        platform,
        COUNT(*) counts
    FROM games
    GROUP BY 1, 2
    HAVING counts > 1
    ORDER BY 3 DESC
    """,
    engine,
)


Unnamed: 0,name,platform,counts
0,Just Dance 2023 Edition,Switch,2


In [30]:
# find entries for above games (and their platform) which duplicates
pd.read_sql_query(
    """
    WITH dups_name_platform AS (
        SELECT name, platform, COUNT(*) counts
        FROM games
        GROUP BY 1, 2
        HAVING counts > 1
    )
    
    SELECT g.*
    FROM games g
    JOIN dups_name_platform
    USING (name, platform)
    """,
    engine,
)


Unnamed: 0,name,platform,release_date,summary,meta_score,user_score
0,Just Dance 2023 Edition,Switch,2022-11-22 00:00:00.000000,Welcome to a new era of dance with Just Dance®...,77,7.6
1,Just Dance 2023 Edition,Switch,2022-11-22 00:00:00.000000,Welcome to a new era of dance with Just Dance®...,76,7.5


#### Duplicate summaries
