# Video Games Review: Analysis

In this project, I explore [Metacritic's video games reviews](https://www.metacritic.com/browse/games/score/metascore/all/all/filtered?page=0), released between 1995-2021.

Please see the [README](https://github.com/henrylin03/video-games) for more information.


## Setup

In this section, I import all necessary libraries, setup the SQLite database, and then import the input `.csv` files in a format that allows manipulation by both `sqlite` and `pandas`:


In [None]:
# import necessary libraries
import os
import pandas as pd
from sqlalchemy import create_engine

# creating SQLite database
engine = create_engine("sqlite://", echo=False)

# import CSV as DataFrames
INPUT_PATH = "./input"
meta_df = pd.read_csv(os.path.join(INPUT_PATH, "meta.csv"))
user_df = pd.read_csv(os.path.join(INPUT_PATH, "user.csv"))


In [None]:
meta_df.to_sql("meta", engine, if_exists="replace", index=False)
pd.read_sql_query("SELECT * FROM meta LIMIT 3", engine)


In [None]:
user_df.to_sql("user", engine, if_exists="replace", index=False)
pd.read_sql_query("SELECT * FROM user LIMIT 3", engine)


## Cleaning

Before analysis, the following steps should be taken:

1. Merge the two tables. If a game does not have a `meta_score` and/or `user_score`, add `NULL` value.
2. Ensuring correct data types -- namely: `release_date` should be in datetime format, and ranking is an integer and not float.
3. Clean the `platform` column by removing the `"Platform: "` prefix

_Missing values?
Duplicate values?_


### Merging Tables

As there are some games that have a Metascore but no User Score due to insufficient reviews (and vice versa), we need to join the two tables together.

Here, we use `pandas.merge()` method as SQLite does not have `FULL OUTER JOIN`s _([SQLite Tutorial](https://www.sqlitetutorial.net/sqlite-full-outer-join/))_.


In [None]:
merged_df = meta_df.merge(user_df, how="outer", on=["name", "platform", "release_date"])

merged_df.head(3)


In [None]:
if not merged_df[merged_df.summary_y.isna() & ~merged_df.summary_x.isna()].empty:
    print('There are rows with a "summary" in column "summary_x" and not "summary_y"')
if not merged_df[merged_df.summary_x.isna() & ~merged_df.summary_y.isna()].empty:
    print('There are rows with a "summary" in column "summary_y" and not "summary_x"')

# thus, we can safely drop the summary_y column
merged_df = merged_df.drop(columns=["summary_y"]).rename(
    columns={"summary_x": "summary"}
)
merged_df.head(3)


### Ensuring correct data types


#### Datetime

As SQLite does not support month names ([StackOverflow](https://stackoverflow.com/questions/1181123/date-formatting-from-sqlite-query)) as is found in the `release_date` column, we use [`pandas.to_datetime()`](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) method instead:


In [None]:
merged_df["release_date"] = pd.to_datetime(merged_df.release_date, format="%B %d, %Y")
merged_df.head(3)


#### Ranks


In [None]:
for col in [c for c in merged_df.columns if c.endswith("_rank")]:
    merged_df[col] = merged_df[col].astype("Int64")  # Int64 type handles null values

merged_df.dtypes  # expected output: `meta_rank` and `user_rank` columns are `Int64 type`


### Removing string prefix in `platform` column
