# EDA

**Objective:** Load each dataset, inspect metadata (row count, data types,
key columns), and confirm unique IDs.

In [None]:
# Import modules
from pathlib import Path
import pandas as pd
from utils import print_header, display_df_info

In [3]:
# Setup paths
data_dir = Path.cwd().parent / 'data'

print_header('Datasets')
for f in data_dir.iterdir():
    print(f.name)

Datasets
--------
categories.csv
descriptions.csv
games.csv
genres.csv
promotional.csv
reviews.csv
steam-200k.csv
steamspy_insights.csv
tags.csv


Within each dataset, I want to confirm a few things:

- Column Names
- Data Types
- Null Counts
- Row Count
- Unique IDs

The `display_df_info()` function will take care of the first 3.

In [47]:
id_cols: dict[str, list[str]] = {}

## Explore `steam-200k`

From the Kaggle page, the column names are:

["user_id", "game_title", "behavior_name", "hours", "extra"]

In [49]:
path = data_dir / 'steam-200k.csv'
columns = ['user_id', 'game_title', 'behavior_name', 'hours', 'extra']

steam_200k = pd.read_csv(path, names=columns)
display_df_info(steam_200k, 'Steam 200k')

### Steam 200k

Unnamed: 0,Dtype,Null Count,Total,% Null
user_id,int64,0,200000,0.0%
game_title,object,0,200000,0.0%
behavior_name,object,0,200000,0.0%
hours,float64,0,200000,0.0%
extra,int64,0,200000,0.0%


Unnamed: 0,user_id,game_title,behavior_name,hours,extra
0,151603712,The Elder Scrolls V Skyrim,purchase,1.0,0
1,151603712,The Elder Scrolls V Skyrim,play,273.0,0
2,151603712,Fallout 4,purchase,1.0,0
3,151603712,Fallout 4,play,87.0,0
4,151603712,Spore,purchase,1.0,0


Based on this dataset, the first 3 columns should be the unique IDs.

In [50]:
# Confirm Unique IDs
id_cols['steam_200k'] = steam_200k.columns.to_list()[:3]
n_rows = len(steam_200k)
n_unique_ids = len(steam_200k.drop_duplicates(subset=id_cols['steam_200k']))

print(f'Dataset Length: {n_rows:,}')
print(f'# Unique IDs:   {n_unique_ids:,}')
assert n_rows == n_unique_ids, f'Invalid Unique ID(s): {id_cols['steam_200k']}'

Dataset Length: 200,000
# Unique IDs:   199,281


AssertionError: Invalid Unique ID(s): ['user_id', 'game_title', 'behavior_name']

In [None]:
steam_200k.sort_values(by=id_cols['steam_200k'],
                       ignore_index=True,
                       inplace=True)

In [53]:
duplicates = steam_200k.loc[steam_200k.duplicated(subset=id_cols['steam_200k'],
                                                  keep=False)]
duplicates

Unnamed: 0,user_id,game_title,behavior_name,hours,extra
961,561758,Sid Meier's Civilization IV,purchase,1.0,0
962,561758,Sid Meier's Civilization IV,purchase,1.0,0
963,561758,Sid Meier's Civilization IV Beyond the Sword,purchase,1.0,0
964,561758,Sid Meier's Civilization IV Beyond the Sword,purchase,1.0,0
965,561758,Sid Meier's Civilization IV Colonization,purchase,1.0,0
...,...,...,...,...,...
194580,267053376,Grand Theft Auto San Andreas,purchase,1.0,0
199119,302237901,Grand Theft Auto San Andreas,purchase,1.0,0
199120,302237901,Grand Theft Auto San Andreas,purchase,1.0,0
199664,305835588,Grand Theft Auto San Andreas,purchase,1.0,0


So it appears that if a user purchases multiple times, each purchase will
appear here. 

Let's drop the duplicates here so we don't have to worry about it.

In [54]:
steam_200k.drop_duplicates(subset=id_cols['steam_200k'],
                           ignore_index=True,
                           inplace=True)
display_df_info(steam_200k)

Unnamed: 0,Dtype,Null Count,Total,% Null
user_id,int64,0,199281,0.0%
game_title,object,0,199281,0.0%
behavior_name,object,0,199281,0.0%
hours,float64,0,199281,0.0%
extra,int64,0,199281,0.0%


Unnamed: 0,user_id,game_title,behavior_name,hours,extra
0,5250,Alien Swarm,play,4.9,0
1,5250,Alien Swarm,purchase,1.0,0
2,5250,Cities Skylines,play,144.0,0
3,5250,Cities Skylines,purchase,1.0,0
4,5250,Counter-Strike,purchase,1.0,0


In [55]:
# Confirm Unique IDs
n_rows = len(steam_200k)
n_unique_ids = len(steam_200k.drop_duplicates(subset=id_cols['steam_200k']))

print(f'Dataset Length: {n_rows:,}')
print(f'# Unique IDs:   {n_unique_ids:,}')
assert n_rows == n_unique_ids, f'Invalid Unique ID(s): {id_cols['steam_200k']}'

Dataset Length: 199,281
# Unique IDs:   199,281


## Explore Steam Insights

### Games

In [46]:
path = data_dir / 'games.csv'
games = pd.read_csv(path, escapechar='\\')
display_df_info(games, 'Games')

### Games

Unnamed: 0,Dtype,Null Count,Total,% Null
app_id,int64,0,140082,0.0%
name,object,0,140082,0.0%
release_date,object,0,140082,0.0%
is_free,int64,0,140082,0.0%
price_overview,object,0,140082,0.0%
languages,object,0,140082,0.0%
type,object,0,140082,0.0%


Unnamed: 0,app_id,name,release_date,is_free,price_overview,languages,type
0,10,Counter-Strike,2000-11-01,0,"{""final"": 819, ""initial"": 819, ""currency"": ""EU...","English<strong>*</strong>, French<strong>*</st...",game
1,20,Team Fortress Classic,1999-04-01,0,"{""final"": 499, ""initial"": 499, ""currency"": ""EU...","English, French, German, Italian, Spanish - Sp...",game
2,30,Day of Defeat,2003-05-01,0,"{""final"": 499, ""initial"": 499, ""currency"": ""EU...","English, French, German, Italian, Spanish - Spain",game
3,40,Deathmatch Classic,2001-06-01,0,"{""final"": 499, ""initial"": 499, ""currency"": ""EU...","English, French, German, Italian, Spanish - Sp...",game
4,50,Half-Life: Opposing Force,1999-11-01,0,"{""final"": 499, ""initial"": 499, ""currency"": ""EU...","English, French, German, Korean",game


In [56]:
# Confirm unique IDs
id_cols['games'] = ['app_id']
n_rows = len(games)
n_unique_ids = len(games.drop_duplicates(subset=id_cols['games']))

print(f'Dataset Length: {n_rows:,}')
print(f'# Unique IDs:   {n_unique_ids:,}')
assert n_rows == n_unique_ids, f'Invalid Unique ID(s): {id_cols['games']}'

Dataset Length: 140,082
# Unique IDs:   140,082


### Genres

In [57]:
path = data_dir / 'genres.csv'
genres = pd.read_csv(path, escapechar='\\')
display_df_info(genres, 'Genres')

### Genres

Unnamed: 0,Dtype,Null Count,Total,% Null
app_id,int64,0,353339,0.0%
genre,object,0,353339,0.0%


Unnamed: 0,app_id,genre
0,10,Action
1,20,Action
2,30,Action
3,40,Action
4,50,Action


I'm guessing that an app can have multiple genres, but I can't confirm this
from the first 5 rows. Let's try assuming that `app_id` is the key ID:

In [58]:
# Confirm unique IDs
id_cols['genres'] = ['app_id']
n_rows = len(genres)
n_unique_ids = len(genres.drop_duplicates(subset=id_cols['genres']))

print(f'Dataset Length: {n_rows:,}')
print(f'# Unique IDs:   {n_unique_ids:,}')
assert n_rows == n_unique_ids, f'Invalid Unique ID(s): {id_cols['genres']}'

Dataset Length: 353,339
# Unique IDs:   122,458


AssertionError: Invalid Unique ID(s): ['app_id']

That's about what I figured. Let's make sure there's no duplicates overall:

In [59]:
# Confirm unique IDs
id_cols['genres'] = ['app_id', 'genre']
n_rows = len(genres)
n_unique_ids = len(genres.drop_duplicates(subset=id_cols['genres']))

print(f'Dataset Length: {n_rows:,}')
print(f'# Unique IDs:   {n_unique_ids:,}')
assert n_rows == n_unique_ids, f'Invalid Unique ID(s): {id_cols['genres']}'

Dataset Length: 353,339
# Unique IDs:   353,339


### Categories

In [60]:
path = data_dir / 'categories.csv'
categories = pd.read_csv(path, escapechar='\\')
display_df_info(categories, 'Categories')

### Categories

Unnamed: 0,Dtype,Null Count,Total,% Null
app_id,int64,0,522582,0.0%
category,object,0,522582,0.0%


Unnamed: 0,app_id,category
0,10,Family Sharing
1,10,Multi-player
2,10,Online PvP
3,10,PvP
4,10,Shared/Split Screen PvP


In [61]:
# Confirm unique IDs
id_cols['categories'] = ['app_id', 'category']
n_rows = len(categories)
n_unique_ids = len(categories.drop_duplicates(subset=id_cols['categories']))

print(f'Dataset Length: {n_rows:,}')
print(f'# Unique IDs:   {n_unique_ids:,}')
assert n_rows == n_unique_ids, f'Invalid Unique ID(s): {id_cols['categories']}'

Dataset Length: 522,582
# Unique IDs:   522,582


### Tags

In [62]:
filename = 'tags.csv'
path = data_dir / filename
tags = pd.read_csv(path, escapechar='\\')
display_df_info(tags, 'Tags')

### Tags

Unnamed: 0,Dtype,Null Count,Total,% Null
app_id,int64,0,1744632,0.0%
tag,object,0,1744632,0.0%


Unnamed: 0,app_id,tag
0,10,1980s
1,10,1990's
2,10,Action
3,10,Assassin
4,10,Classic


In [63]:
# Confirm unique IDs
id_cols['tags'] = ['app_id', 'tag']
n_rows = len(tags)
n_unique_ids = len(tags.drop_duplicates(subset=id_cols['tags']))

print(f'Dataset Length: {n_rows:,}')
print(f'# Unique IDs:   {n_unique_ids:,}')
assert n_rows == n_unique_ids, f'Invalid Unique ID(s): {id_cols['tags']}'

Dataset Length: 1,744,632
# Unique IDs:   1,744,632


### Descriptions

In [64]:
filename = 'descriptions.csv'
path = data_dir / filename
descriptions = pd.read_csv(path, escapechar='\\')
display_df_info(descriptions, 'Descriptions')

### Descriptions

Unnamed: 0,Dtype,Null Count,Total,% Null
app_id,int64,0,140082,0.0%
summary,object,0,140082,0.0%
extensive,object,0,140082,0.0%
about,object,0,140082,0.0%


Unnamed: 0,app_id,summary,extensive,about
0,10,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...
1,20,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...
2,30,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...
3,40,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...
4,50,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...


For a game description, I would expect that each game would only have one row.
The row count of 140k is approximately the same as the number of games in the
`games` dataset. 

Lets confirm:

In [65]:
# Confirm unique IDs
id_cols['descriptions'] = ['app_id']
n_rows = len(descriptions)
n_unique_ids = len(descriptions.drop_duplicates(subset=id_cols['descriptions']))

print(f'Dataset Length: {n_rows:,}')
print(f'# Unique IDs:   {n_unique_ids:,}')
assert n_rows == n_unique_ids, f'Invalid Unique ID(s): {id_cols['descriptions']}'

Dataset Length: 140,082
# Unique IDs:   140,082


### Reviews

In [66]:
filename = 'reviews.csv'
path = data_dir / filename
reviews = pd.read_csv(path, escapechar='\\', low_memory=False)
display_df_info(reviews, 'Reviews')

### Reviews

Unnamed: 0,Dtype,Null Count,Total,% Null
app_id,int64,0,140082,0.0%
review_score,object,0,140082,0.0%
review_score_description,object,0,140082,0.0%
positive,object,0,140082,0.0%
negative,object,0,140082,0.0%
total,object,0,140082,0.0%
metacritic_score,object,0,140082,0.0%
reviews,object,0,140082,0.0%
recommendations,object,0,140082,0.0%
steamspy_user_score,object,0,140082,0.0%


Unnamed: 0,app_id,review_score,review_score_description,positive,negative,total,metacritic_score,reviews,recommendations,steamspy_user_score,steamspy_score_rank,steamspy_positive,steamspy_negative
0,10,9,Overwhelmingly Positive,235403,6207,241610,88,N,153259,0,N,235397,6207
1,20,8,Very Positive,7315,1094,8409,N,N,6268,0,N,7314,1092
2,30,8,Very Positive,6249,672,6921,79,N,4146,0,N,6246,672
3,40,8,Very Positive,2542,524,3066,N,N,2218,0,N,2541,525
4,50,9,Overwhelmingly Positive,22263,1111,23374,N,N,20144,0,N,22260,1112


In [67]:
# Confirm unique IDs
id_cols['reviews'] = ['app_id']
n_rows = len(reviews)
n_unique_ids = len(reviews.drop_duplicates(subset=id_cols['reviews']))

print(f'Dataset Length: {n_rows:,}')
print(f'# Unique IDs:   {n_unique_ids:,}')
assert n_rows == n_unique_ids, f'Invalid Unique ID(s): {id_cols['reviews']}'

Dataset Length: 140,082
# Unique IDs:   140,082


### Promotional

In [68]:
filename = 'promotional.csv'
path = data_dir / filename
promotional = pd.read_csv(path, escapechar='\\')
display_df_info(promotional, 'Promotional')

### Promotional

Unnamed: 0,Dtype,Null Count,Total,% Null
app_id,int64,0,140082,0.0%
header_image,object,0,140082,0.0%
background_image,object,0,140082,0.0%
screenshots,object,0,140082,0.0%
movies,object,0,140082,0.0%


Unnamed: 0,app_id,header_image,background_image,screenshots,movies
0,10,https://shared.akamai.steamstatic.com/store_it...,https://shared.akamai.steamstatic.com/store_it...,"[{""id"": 0, ""path_full"": ""https://shared.akamai...",N
1,20,https://shared.akamai.steamstatic.com/store_it...,https://shared.akamai.steamstatic.com/store_it...,"[{""id"": 0, ""path_full"": ""https://shared.akamai...",N
2,30,https://shared.akamai.steamstatic.com/store_it...,https://shared.akamai.steamstatic.com/store_it...,"[{""id"": 0, ""path_full"": ""https://shared.akamai...",N
3,40,https://shared.akamai.steamstatic.com/store_it...,https://shared.akamai.steamstatic.com/store_it...,"[{""id"": 0, ""path_full"": ""https://shared.akamai...",N
4,50,https://shared.akamai.steamstatic.com/store_it...,https://shared.akamai.steamstatic.com/store_it...,"[{""id"": 0, ""path_full"": ""https://shared.akamai...",N


In [69]:
# Confirm unique IDs
id_cols['promotional'] = ['app_id']
n_rows = len(promotional)
n_unique_ids = len(promotional.drop_duplicates(subset=id_cols['promotional']))

print(f'Dataset Length: {n_rows:,}')
print(f'# Unique IDs:   {n_unique_ids:,}')
assert n_rows == n_unique_ids, f'Invalid Unique ID(s): {id_cols['promotional']}'

Dataset Length: 140,082
# Unique IDs:   140,082


### SteamSpy Insights

In [70]:
filename = 'steamspy_insights.csv'
path = data_dir / filename
steamspy_insights = pd.read_csv(path, escapechar='\\')
display_df_info(steamspy_insights, 'SteamSpy Insights')

### SteamSpy Insights

Unnamed: 0,Dtype,Null Count,Total,% Null
app_id,int64,0,140077,0.0%
developer,object,3,140077,0.0%
publisher,object,35,140077,0.0%
owners_range,object,0,140077,0.0%
concurrent_users_yesterday,int64,0,140077,0.0%
playtime_average_forever,int64,0,140077,0.0%
playtime_average_2weeks,int64,0,140077,0.0%
playtime_median_forever,int64,0,140077,0.0%
playtime_median_2weeks,int64,0,140077,0.0%
price,object,0,140077,0.0%


Unnamed: 0,app_id,developer,publisher,owners_range,concurrent_users_yesterday,playtime_average_forever,playtime_average_2weeks,playtime_median_forever,playtime_median_2weeks,price,initial_price,discount,languages,genres
0,10,Valve,Valve,"10,000,000 .. 20,000,000",11457,0,0,0,0,999,999,0,"English, French, German, Italian, Spanish - Sp...",Action
1,20,Valve,Valve,"5,000,000 .. 10,000,000",52,0,0,0,0,499,499,0,"English, French, German, Italian, Spanish - Sp...",Action
2,30,Valve,Valve,"5,000,000 .. 10,000,000",82,0,0,0,0,499,499,0,"English, French, German, Italian, Spanish - Spain",Action
3,40,Valve,Valve,"5,000,000 .. 10,000,000",6,0,0,0,0,499,499,0,"English, French, German, Italian, Spanish - Sp...",Action
4,50,Gearbox Software,Valve,"2,000,000 .. 5,000,000",99,0,0,0,0,499,499,0,"English, French, German, Korean",Action


In [71]:
# Confirm unique IDs
id_cols['steamspy_insights'] = ['app_id']
n_rows = len(steamspy_insights)
n_unique_ids = steamspy_insights \
               .drop_duplicates(subset=id_cols['steamspy_insights']) \
               .shape[0]

print(f'Dataset Length: {n_rows:,}')
print(f'# Unique IDs:   {n_unique_ids:,}')
assert n_rows == n_unique_ids, \
    f'Invalid Unique ID(s): {id_cols['steamspy_insights']}'

Dataset Length: 140,077
# Unique IDs:   140,077


## Dataset Map

Most datasets have `app_id` as their key identifier:

- Descriptions
- Games
- Promotional
- Reviews
- SteamSpy Insights

However, 3 datasets represent a many-to-many relationship. In addition to
`app_id`, these datasets have more ID columns:

- **Categories**
  - ID Columns: `app_id`, `category`
  - A game can have many *categories*, and a category can describe many *games*.
- **Genres**
  - ID Columns: `app_id`, `genre`
  - A game can have many *genres*, and a genre can describe many *games*.
- **Tags**
  - ID Columns: `app_id`, `tag`
  - A game can have many *tags*, and a tag can describe many *games*.

The outlying dataset is the **Steam 200k** dataset. This one has 3 ID columns:
`user_id`, `game_title`, and `behavior_name`. This is not an ideal setup to
work with other games, as it doesn't share an ID column with other datasets.

To correct this, a couple steps will need to be taken:

1. Split the dataset into `playtime` and `purchases`, filtering for each type
   of behavior.
2. Clean the `game_title` column to match the values in the `name` column of
   the `games` dataset.

The first action is trivial, but the second action will taken a lot of data
cleaning and preparation.