# Notebook for the Game Dimension 

This notebook shows the Extraction and transformtion steps implemented for the Game dimension

## Required Imports

In [1]:
import pandas as pd
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as mno
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
%matplotlib inline

# Exploration:

## Loading the dataset for game dimension.

In [2]:
df_three = pd.read_json("../assets/game/playtime.json")

- The descripition below shows the initial state of the dataset.

In [3]:
df_three.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60393 entries, 0 to 60392
Data columns (total 13 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Name                                 60393 non-null  object 
 1   Stats                                60393 non-null  object 
 2   steam_app_id                         33002 non-null  float64
 3   Release_date                         60393 non-null  object 
 4   Genres                               60393 non-null  object 
 5   Review_score                         60393 non-null  int64  
 6   Single-Player_MainStory_Average      30039 non-null  object 
 7   Single-Player_Main+Extras_Average    20458 non-null  object 
 8   SinglePlayer_All_PlayStyles_Polled   36506 non-null  object 
 9   SinglePlayer_All_PlayStyles_Average  36506 non-null  object 
 10  SinglePlayer_All_PlayStyles_Median   36506 non-null  object 
 11  SinglePlayer_All_PlayStyles_

### First Step: 
- We initially want to convert all nuumerical attributes/columns from object types to float types as this accomodates varios analytics processes.
- Hence below we have a script `extract` that converts `Release_date` column values from string of type `DD-MM-YYYY` to a string of `YYYY` format.

In [4]:
def extract(s):
    return s.split('-')[0]
df_three.Release_date = df_three.Release_date.astype(str).apply(lambda x : extract(x))

- The summary below indicates the changed `Release_date` column.

In [5]:
df_three.head()

Unnamed: 0,Name,Stats,steam_app_id,Release_date,Genres,Review_score,Single-Player_MainStory_Average,Single-Player_Main+Extras_Average,SinglePlayer_All_PlayStyles_Polled,SinglePlayer_All_PlayStyles_Average,SinglePlayer_All_PlayStyles_Median,SinglePlayer_All_PlayStyles_Rushed,SinglePlayer_All_PlayStyles_Leisure
0,Borderlands 3,{'Additional Content': {'Moxxi's Heist of the ...,397540.0,2019,"First-Person, Action, Shooter",76,23h 17m,47h 3m,1.4K,43h 24m,35h,25h 9m,268h 36m
1,Dying Light,{'Additional Content': {'The Following': {'Pol...,239140.0,2015,"Action, Adventure, Open World, Survival Horror",78,18h 10m,37h 10m,1.9K,32h 9m,27h 47m,18h 15m,117h 29m
2,Middle-Earth: Shadow of War,{'Additional Content': {'Blade of Galadriel': ...,356190.0,2017,"Action, Adventure",76,20h 47m,37h 1m,1.4K,38h 30m,34h,20h 7m,118h 7m
3,Counter-Strike: Global Offensive,"{'Multi-Player': {'Co-Op': {'Polled': '34', 'A...",730.0,2012,"First-Person, Shooter",74,,,,,,,
4,Grand Theft Auto IV,{'Additional Content': {'The Lost and Damned':...,5152.0,2008,"Third-Person, Action, Sandbox, Shooter",82,27h 37m,41h 24m,2K,37h 48m,32h,21h 27m,152h 27m


- Next we convert the `Release_date` column to integer from object type.

In [6]:
# Convert Release_date column from object to integer.
df_three['Release_date'] = df_three['Release_date'].astype('int')
df_three.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60393 entries, 0 to 60392
Data columns (total 13 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Name                                 60393 non-null  object 
 1   Stats                                60393 non-null  object 
 2   steam_app_id                         33002 non-null  float64
 3   Release_date                         60393 non-null  int64  
 4   Genres                               60393 non-null  object 
 5   Review_score                         60393 non-null  int64  
 6   Single-Player_MainStory_Average      30039 non-null  object 
 7   Single-Player_Main+Extras_Average    20458 non-null  object 
 8   SinglePlayer_All_PlayStyles_Polled   36506 non-null  object 
 9   SinglePlayer_All_PlayStyles_Average  36506 non-null  object 
 10  SinglePlayer_All_PlayStyles_Median   36506 non-null  object 
 11  SinglePlayer_All_PlayStyles_

- As per the current market trends and taking into consideration various other pop culture trends we decide to remove data related to any game that was rrelease beefore the yeaar `2000`.
- Moreover lack of relevant data for such games, made us finalise this decision.

In [7]:
# Remove all entries that have a Release_date(yeear) of less than 2000.
df_three = df_three[df_three['Release_date'] > 2000]

In [8]:
# Now we have reduced total entries from approx 60000 to 46000.
df_three.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46355 entries, 0 to 60391
Data columns (total 13 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Name                                 46355 non-null  object 
 1   Stats                                46355 non-null  object 
 2   steam_app_id                         31297 non-null  float64
 3   Release_date                         46355 non-null  int64  
 4   Genres                               46355 non-null  object 
 5   Review_score                         46355 non-null  int64  
 6   Single-Player_MainStory_Average      23248 non-null  object 
 7   Single-Player_Main+Extras_Average    16974 non-null  object 
 8   SinglePlayer_All_PlayStyles_Polled   28769 non-null  object 
 9   SinglePlayer_All_PlayStyles_Average  28769 non-null  object 
 10  SinglePlayer_All_PlayStyles_Median   28769 non-null  object 
 11  SinglePlayer_All_PlayStyles_

- In the next 5-6 steps we convert the tables(from type object to float):
- SinglePlayer_All_PlayStyles_Polled
- SinglePlayer_All_PlayStyles_Average
- SinglePlayer_All_PlayStyles_Median
- SinglePlayer_All_PlayStyles_Rushed
- SinglePlayer_All_PlayStyles_Leisure
> We take transformation steps where we are converting all time values from `hh mm` format to just minutes float type.
> Moreover, we also convert any units to actual digits. for example `1.4k` -> `1400`.

In [9]:
df_test = df_three

In [10]:
# Convert object to float for singleplayer gametime polled column
def o_to_f(x):
    t=str(x)
    if 'K' in t:
        return float(t.replace('K','')) * 1000
    else:
        return float(t)

df_test['SinglePlayer_All_PlayStyles_Polled'] = df_test['SinglePlayer_All_PlayStyles_Polled'].apply(lambda x: o_to_f(x))

In [11]:
# Convert object to float for singleplayer gametime Average column.
def o_to_f(x):
    t=str(x)
    if " " in t:
        t = t.split(" ")
        if len(t) == 2:
            h = float(t[0].replace("h", "")) * 60
            m = float(t[1].replace("m", ""))
            return h+m
        elif len(t) == 1:
            if "m" in t[1]:
                n = float(t[1].replace("m", ""))
                return n
            elif "h" in t[1]:
                n = float(t[1].replace("h", ""))
                return n
    elif "nan" not in t:
        if "m" in t:
            n = float(t.replace("m", ""))
            return n
        elif "h" in t:
            n = float(t.replace("h", ""))
            return n
    else:
        return float(t)
    
    
df_test['SinglePlayer_All_PlayStyles_Average'] = df_test['SinglePlayer_All_PlayStyles_Average'].apply(lambda x: o_to_f(x))
df_test_two = df_test


In [12]:
# Convert object to float for singleplayer gametime Median column.
def o_to_f(x):
    t=str(x)
    if " " in t:
        t = t.split(" ")
        if len(t) == 2:
            h = float(t[0].replace("h", "")) * 60
            m = float(t[1].replace("m", ""))
            return h+m
        elif len(t) == 1:
            if "m" in t[1]:
                n = float(t[1].replace("m", ""))
                return n
            elif "h" in t[1]:
                n = float(t[1].replace("h", ""))
                return n
    elif "nan" not in t:
        if "m" in t:
            n = float(t.replace("m", ""))
            return n
        elif "h" in t:
            n = float(t.replace("h", ""))
            return n
    else:
        return float(t)
    
    
df_test_two['SinglePlayer_All_PlayStyles_Median'] = df_test_two['SinglePlayer_All_PlayStyles_Median'].apply(lambda x: o_to_f(x))
df_test_three = df_test_two


In [13]:
# Convert object to float for singleplayer gametime Rushed column.
def o_to_f(x):
    t=str(x)
    if " " in t:
        t = t.split(" ")
        if len(t) == 2:
            h = float(t[0].replace("h", "")) * 60
            m = float(t[1].replace("m", ""))
            return h+m
        elif len(t) == 1:
            if "m" in t[1]:
                n = float(t[1].replace("m", ""))
                return n
            elif "h" in t[1]:
                n = float(t[1].replace("h", ""))
                return n
    elif "nan" not in t:
        if "m" in t:
            n = float(t.replace("m", ""))
            return n
        elif "h" in t:
            n = float(t.replace("h", ""))
            return n
    else:
        return float(t)
    
    
df_test_three['SinglePlayer_All_PlayStyles_Rushed'] = df_test_three['SinglePlayer_All_PlayStyles_Rushed'].apply(lambda x: o_to_f(x))
df_test_four = df_test_three


In [14]:
# Convert object to float for singleplayer gametime Leisure column.
def o_to_f(x):
    t=str(x)
    if " " in t:
        t = t.split(" ")
        if len(t) == 2:
            h = float(t[0].replace("h", "")) * 60
            m = float(t[1].replace("m", ""))
            return h+m
        elif len(t) == 1:
            if "m" in t[1]:
                n = float(t[1].replace("m", ""))
                return n
            elif "h" in t[1]:
                n = float(t[1].replace("h", ""))
                return n
    elif "nan" not in t:
        if "m" in t:
            n = float(t.replace("m", ""))
            return n
        elif "h" in t:
            n = float(t.replace("h", ""))
            return n
    else:
        return float(t)
    
    
df_test_four['SinglePlayer_All_PlayStyles_Leisure'] = df_test_four['SinglePlayer_All_PlayStyles_Leisure'].apply(lambda x: o_to_f(x))
df_test_five = df_test_four


In [15]:
df_test_five.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46355 entries, 0 to 60391
Data columns (total 13 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Name                                 46355 non-null  object 
 1   Stats                                46355 non-null  object 
 2   steam_app_id                         31297 non-null  float64
 3   Release_date                         46355 non-null  int64  
 4   Genres                               46355 non-null  object 
 5   Review_score                         46355 non-null  int64  
 6   Single-Player_MainStory_Average      23248 non-null  object 
 7   Single-Player_Main+Extras_Average    16974 non-null  object 
 8   SinglePlayer_All_PlayStyles_Polled   28769 non-null  float64
 9   SinglePlayer_All_PlayStyles_Average  28759 non-null  float64
 10  SinglePlayer_All_PlayStyles_Median   28761 non-null  float64
 11  SinglePlayer_All_PlayStyles_

- We drop any unwanted irrelevant attributes as below.

In [16]:
game = df_test_five.drop(['Stats', 'steam_app_id', 'Single-Player_MainStory_Average', 'Single-Player_Main+Extras_Average'], axis=1)

In [17]:
game.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46355 entries, 0 to 60391
Data columns (total 9 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Name                                 46355 non-null  object 
 1   Release_date                         46355 non-null  int64  
 2   Genres                               46355 non-null  object 
 3   Review_score                         46355 non-null  int64  
 4   SinglePlayer_All_PlayStyles_Polled   28769 non-null  float64
 5   SinglePlayer_All_PlayStyles_Average  28759 non-null  float64
 6   SinglePlayer_All_PlayStyles_Median   28761 non-null  float64
 7   SinglePlayer_All_PlayStyles_Rushed   28754 non-null  float64
 8   SinglePlayer_All_PlayStyles_Leisure  28760 non-null  float64
dtypes: float64(5), int64(2), object(2)
memory usage: 3.5+ MB


In [18]:
game.describe()

Unnamed: 0,Release_date,Review_score,SinglePlayer_All_PlayStyles_Polled,SinglePlayer_All_PlayStyles_Average,SinglePlayer_All_PlayStyles_Median,SinglePlayer_All_PlayStyles_Rushed,SinglePlayer_All_PlayStyles_Leisure
count,46355.0,46355.0,28769.0,28759.0,28761.0,28754.0,28760.0
mean,2014.806774,39.158775,59.935451,814.441323,460.368763,556.837831,1668.922705
std,5.453828,33.061904,294.731197,4844.219581,3988.656566,4059.025557,9181.751426
min,2001.0,0.0,1.0,1.0,1.0,1.0,1.0
25%,2012.0,0.0,2.0,73.0,20.0,61.0,93.0
50%,2016.0,50.0,4.0,254.0,110.0,199.0,349.5
75%,2019.0,70.0,19.0,685.0,390.0,506.0,1089.25
max,2024.0,100.0,9500.0,596970.0,596970.0,595485.0,598455.0


- We want to check if our data frame has any null values and we do that as follows:

In [19]:
game.isnull().sum()[1:9]

Release_date                               0
Genres                                     0
Review_score                               0
SinglePlayer_All_PlayStyles_Polled     17586
SinglePlayer_All_PlayStyles_Average    17596
SinglePlayer_All_PlayStyles_Median     17594
SinglePlayer_All_PlayStyles_Rushed     17601
SinglePlayer_All_PlayStyles_Leisure    17595
dtype: int64

In [20]:
game.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46355 entries, 0 to 60391
Data columns (total 9 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Name                                 46355 non-null  object 
 1   Release_date                         46355 non-null  int64  
 2   Genres                               46355 non-null  object 
 3   Review_score                         46355 non-null  int64  
 4   SinglePlayer_All_PlayStyles_Polled   28769 non-null  float64
 5   SinglePlayer_All_PlayStyles_Average  28759 non-null  float64
 6   SinglePlayer_All_PlayStyles_Median   28761 non-null  float64
 7   SinglePlayer_All_PlayStyles_Rushed   28754 non-null  float64
 8   SinglePlayer_All_PlayStyles_Leisure  28760 non-null  float64
dtypes: float64(5), int64(2), object(2)
memory usage: 3.5+ MB


# Filling Missing Data using Sci-Kit's Iterative Imputator

- We use the Iterative Imputator to help us fill in the missing values in all our columns, this iterator uses the random forest regressor to estimate and predict values by considering values from all other values of other attributes.
- For reference [Link to documentation](https://scikit-learn.org/stable/modules/impute.html)

In [None]:
imp_mean = IterativeImputer(estimator=RandomForestRegressor(), random_state=0, verbose=2)
features = ['Release_date','Review_score','SinglePlayer_All_PlayStyles_Polled','SinglePlayer_All_PlayStyles_Average','SinglePlayer_All_PlayStyles_Median','SinglePlayer_All_PlayStyles_Rushed','SinglePlayer_All_PlayStyles_Leisure']
imp_mean.fit(game[features])

[IterativeImputer] Completing matrix with shape (46355, 7)
[IterativeImputer] Ending imputation round 1/10, elapsed time 78.53
[IterativeImputer] Change: 2285.3671152977668, scaled tolerance: 598.455 
[IterativeImputer] Ending imputation round 2/10, elapsed time 163.13
[IterativeImputer] Change: 7915.790000000001, scaled tolerance: 598.455 
[IterativeImputer] Ending imputation round 3/10, elapsed time 246.89
[IterativeImputer] Change: 6426.099999999999, scaled tolerance: 598.455 
[IterativeImputer] Ending imputation round 4/10, elapsed time 330.26
[IterativeImputer] Change: 7005.969999999999, scaled tolerance: 598.455 
[IterativeImputer] Ending imputation round 5/10, elapsed time 412.65
[IterativeImputer] Change: 10730.01, scaled tolerance: 598.455 
[IterativeImputer] Ending imputation round 6/10, elapsed time 497.83
[IterativeImputer] Change: 11155.01, scaled tolerance: 598.455 
[IterativeImputer] Ending imputation round 7/10, elapsed time 583.04
[IterativeImputer] Change: 10652.72000

- We concatenate the new dataframe containing the filled missing values with the old dataframe.

In [None]:
gameTransformed = pd.concat([game,pd.DataFrame(imp_mean.transform(game[features]))], axis=1, join='inner')

- Summary below:

In [None]:
gameTransformed.head()

- Another look at our new concatenated dataframe.

In [None]:
gameTransformed.info()

- Drop the redundant and extra tables.

In [None]:
gameTransformed = gameTransformed.drop([0, 1], axis=1)

In [None]:
gameTransformed = gameTransformed.drop(['SinglePlayer_All_PlayStyles_Polled','SinglePlayer_All_PlayStyles_Average','SinglePlayer_All_PlayStyles_Median','SinglePlayer_All_PlayStyles_Rushed','SinglePlayer_All_PlayStyles_Leisure'], axis=1)
gameTransformed.rename(columns = {2:'SinglePlayer_All_PlayStyles_Polled', 3:'SinglePlayer_All_PlayStyles_Average', 4:'SinglePlayer_All_PlayStyles_Median', 5:'SinglePlayer_All_PlayStyles_Rushed', 6:'SinglePlayer_All_PlayStyles_Leisure'}, inplace = True)

- Our game dimension data finally transformed.

In [None]:
gameTransformed.info()

- Generate a csv file.

In [None]:
# Write to CSV.
gameTransformed.to_csv('game_dimension.csv')