# RAWG — Business-focused EDA
This notebook performs a business- and research-centric exploratory data analysis of the RAWG cleaned dataset (`data/processed/rawg_cleaned.csv`).
Contract:
- Input: `data/processed/rawg_cleaned.csv` (filtered to 2024-11-11 -> 2025-11-11)
- Outputs: interactive Plotly charts for executive dashboards; engineered features saved when useful.

In [1]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from datetime import datetime
pd.set_option('display.max_columns', 100)

In [2]:
PATH = '/home/jubaer/Downloads/GOTY 2025 analysis/data/processed/rawg_cleaned.csv'
assert os.path.exists(PATH), f'Missing {PATH} - run cleaners first'
df = pd.read_csv(PATH, dtype=str)
# normalize release_date
df['release_date'] = pd.to_datetime(df.get('release_date', ''), errors='coerce')
df['ratings'] = pd.to_numeric(df.get('ratings', None), errors='coerce')
df['metacritic'] = pd.to_numeric(df.get('metacritic', None), errors='coerce')
# shorten description for quick display
if 'description' in df.columns:
    df['description_short'] = df['description'].fillna('').str.slice(0, 200).str.replace('\\n', ' ')
df.shape


(2360, 12)

In [3]:
# Quick overview
display(df.head(5))
display(df.info())

Unnamed: 0,rawg_id,rawg_slug,name,release_date,genres,tags,ratings,platforms,esrb,metacritic,description,description_short
0,303576,vampire-the-masquerade-bloodlines-2,Vampire: The Masquerade – Bloodlines 2,2025-10-21,Action|RPG,Singleplayer|Для одного игрока|Экшен|Приключен...,3.85,PC|PlayStation 5|Xbox One|PlayStation 4|Xbox S...,Unknown,,"Sired in an act of vampire insurrection, your ...","Sired in an act of vampire insurrection, your ..."
1,58386,stalker-2,S.T.A.L.K.E.R. 2: Heart of Chornobyl,2024-11-20,Shooter|Adventure|Action|RPG,Singleplayer|Для одного игрока|Экшен|Steam Ach...,3.77,PC|Xbox Series S/X,Unknown,,S.T.A.L.K.E.R. 2 is a brand-new entry in the l...,S.T.A.L.K.E.R. 2 is a brand-new entry in the l...
2,292844,hollow-knight-silksong,Hollow Knight: Silksong,2025-09-04,Indie|Platformer|Adventure|Action,Singleplayer|Для одного игрока|Экшен|Приключен...,4.33,PC|PlayStation 5|Xbox One|PlayStation 4|Xbox S...,Unknown,,Hollow Knight: Silksong is the epic sequel to ...,Hollow Knight: Silksong is the epic sequel to ...
3,993875,marvel-rivals,Marvel Rivals,2024-12-06,Action,Экшен|Steam Achievements|Multiplayer|Full cont...,3.69,PC|PlayStation 5|Xbox Series S/X,Teen,,Marvel Rivals is a Super Hero Team-Based PVP S...,Marvel Rivals is a Super Hero Team-Based PVP S...
4,983210,clair-obscur-expedition-33,Clair Obscur: Expedition 33,2025-04-24,RPG,Singleplayer|Для одного игрока|Экшен|Приключен...,4.52,PC|PlayStation 5|Xbox Series S/X,Unknown,,"Once a year, the Paintress wakes and paints up...","Once a year, the Paintress wakes and paints up..."


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2360 entries, 0 to 2359
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   rawg_id            2360 non-null   object        
 1   rawg_slug          2360 non-null   object        
 2   name               2360 non-null   object        
 3   release_date       2360 non-null   datetime64[ns]
 4   genres             2342 non-null   object        
 5   tags               2271 non-null   object        
 6   ratings            2326 non-null   float64       
 7   platforms          2329 non-null   object        
 8   esrb               2360 non-null   object        
 9   metacritic         1 non-null      float64       
 10  description        2323 non-null   object        
 11  description_short  2360 non-null   object        
dtypes: datetime64[ns](1), float64(2), object(9)
memory usage: 221.4+ KB


None

## Feature engineering: explode list fields for analysis (genres, tags, platforms)
We'll create exploded tables for genre- and platform-level aggregates which are useful in dashboards (top genres, platform share, tag insights).

In [4]:
for col in ['genres', 'tags', 'platforms']:
    if col not in df.columns:
        df[col] = ''
    df[col] = df[col].fillna('').astype(str)
    df[col + '_list'] = df[col].apply(lambda s: [x.strip() for x in s.split('|') if x.strip()])

# exploded genres for counting
df_genre = df.explode('genres_list')
df_genre['genres_list'] = df_genre['genres_list'].fillna('')
genre_counts = df_genre[df_genre['genres_list'] != '']['genres_list'].value_counts().reset_index()
genre_counts.columns = ['genre','count']
genre_counts.head()

Unnamed: 0,genre,count
0,Indie,1474
1,Adventure,1105
2,Action,912
3,Casual,862
4,Simulation,681


In [5]:
# Top genres (business view)
fig = px.bar(genre_counts.head(20), x='genre', y='count', title='Top genres (by number of releases)',
             text='count', template='plotly_white')
fig.update_layout(xaxis_tickangle=-45, height=480)
fig.show()

### Ratings distribution and summary statistics
Understand user rating distribution (RAWG ratings) and Metacritic where available.

In [6]:
# Ratings histogram
fig = px.histogram(df, x='ratings', nbins=40, title='Distribution of RAWG user ratings', template='plotly_white')
fig.update_layout(height=420)
fig.show()

# Metacritic where present
if df['metacritic'].notna().any():
    fig2 = px.histogram(df, x='metacritic', nbins=40, title='Metacritic score distribution (where available)', template='plotly_white')
    fig2.update_layout(height=420)
    fig2.show()

### Platform share and release timeline
A business view of platform distribution and releases over the year.

In [7]:
# Platform share (explode platforms)
df_plat = df.explode('platforms_list')
df_plat['platforms_list'] = df_plat['platforms_list'].fillna('')
plat_counts = df_plat[df_plat['platforms_list'] != '']['platforms_list'].value_counts().reset_index()
plat_counts.columns = ['platform','count']
fig = px.pie(plat_counts.head(15), names='platform', values='count', title='Platform share (top platforms)', template='plotly_white')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

# Releases over time (by month)
df['release_month'] = df['release_date'].dt.to_period('M').dt.to_timestamp()
timeline = df.groupby('release_month').size().reset_index(name='count')
fig = px.line(timeline, x='release_month', y='count', title='Releases by month', markers=True, template='plotly_white')
fig.update_layout(xaxis_title='Month', height=420)
fig.show()

### Top titles and short profiles
Create a compact executive table showing top-rated and most-discussed titles with key attributes.

In [8]:
top_rated = df.sort_values(by='ratings', ascending=False).head(20)[['name','release_date','ratings','metacritic','platforms','genres']].copy()
top_rated['ratings'] = top_rated['ratings'].round(2)
top_rated.reset_index(drop=True, inplace=True)
top_rated

Unnamed: 0,name,release_date,ratings,metacritic,platforms,genres
0,Dispatch,2025-10-22,4.77,,PC|PlayStation 5,Adventure|Action|Casual|Strategy|Indie
1,System Shock 2: 25th Anniversary Remaster,2025-06-25,4.62,,PC,Adventure|Action|RPG
2,DRAGON BALL: Sparking! ZERO,2025-10-10,4.6,,PC|PlayStation 5|Xbox Series S/X|Nintendo Switch,Action|Fighting
3,Legacy of Kain™ Soul Reaver 1&2 Remastered,2024-12-10,4.57,,PC|PlayStation 5|Xbox One|PlayStation 4|Xbox S...,Platformer|Adventure|Action|Puzzle
4,Clair Obscur: Expedition 33,2025-04-24,4.52,,PC|PlayStation 5|Xbox Series S/X,RPG
5,Split Fiction,2025-03-06,4.44,,PC|PlayStation 5|Xbox Series S/X,Adventure|Action
6,"One Iced Latte With Your Breast Milk, Please! ☕",2025-03-05,4.43,,PC,Casual|Indie|Adventure|Simulation
7,Bionic Bay,2025-04-16,4.43,,PC,Racing|Adventure|Action|Casual|Indie|Platformer
8,Hades II,2025-09-25,4.4,,PC|Nintendo Switch,Indie|Adventure|Action|RPG
9,The Rise of the Golden Idol,2024-11-12,4.38,,PC|PlayStation 5|Xbox One|Xbox Series S/X|Nint...,Indie|Adventure|Puzzle


## Save engineered features (optional)
Saving exploded genre and platform tallies to `data/processed/` so they can be reused by dashboards or Tableau.

In [9]:
os.makedirs('data/processed', exist_ok=True)
genre_counts.to_csv('data/processed/rawg_genre_counts.csv', index=False)
plat_counts.to_csv('data/processed/rawg_platform_counts.csv', index=False)
print('Wrote rawg_genre_counts and rawg_platform_counts to data/processed')

Wrote rawg_genre_counts and rawg_platform_counts to data/processed
