# Numerical Variables Analysis

We will be examining the categorical variables of the `kaggle_steam` dataset (available in `data/01_raw/kaggle_steam.csv`) in this notebook.

After analysis, we will propose categorical features to adopt for our clustering model, KPrototypes.

The dataset has **11** categorical variables: `appid`, `name`, `release_date`, `english`, `developer`. `publisher`, `platforms`, `categories`, `genres`, `steamspy_tags` and `owners`.

We will be examining the following categorical variables to see if we can use them for clustering.

- `release_date`
- `english`
- `platforms`
- `categories`
- `genres`
- `steamspy_tags`
- `owners`

We omitted:

- `app_id` as it is id value, it has no meaning,
- `name` is unique per record, hence has no meaning
- `developer` is immediately apparent to have many labels, it will increase dimensionality of our data too much and impact our model's performance
- `publisher` same as above

**Setting up**

In [1]:
%load_ext kedro.ipython
%load_ext autoreload
%matplotlib inline
%autoreload 2

In [2]:
import pandas as pd
import polars as pl
import numpy as np

from matplotlib import rc_context
import matplotlib.pyplot as plt
import seaborn as sb
from seaborn.objects import Plot
import seaborn.objects as so

from sklearn.preprocessing import maxabs_scale

import logging

from usg.utils import *

log = logging.getLogger(__name__)
log.setLevel(logging.INFO)
sb.set()

**All features of `kaggle_steam` and their dtypes**

The dataset has **11** categorical variables: `appid`, `name`, `release_date`, `english`, `developer`. `publisher`, `platforms`, `categories`, `genres`, `steamspy_tags` and `owners`.

In [3]:
kaggle_steam: pd.DataFrame = catalog.load('kaggle_steam')
kaggle_steam.dtypes.to_frame().T

Unnamed: 0,appid,name,release_date,english,developer,publisher,platforms,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price
0,int64,object,object,int64,object,object,object,int64,object,object,object,int64,int64,int64,int64,int64,object,float64


# Summary

In [4]:
ratings_ratio = kaggle_steam['positive_ratings'] / kaggle_steam['negative_ratings']
ratings_ratio.loc[np.isinf(ratings_ratio)] = 0

In [7]:
keep = pd.concat([
    kaggle_steam[['appid', 'required_age', 'achievements', 'positive_ratings', 'negative_ratings', 'average_playtime', 'median_playtime', 'price']], 
    ratings_ratio.to_frame('ratings_ratio'), 
], axis=1)

catalog.save('features_eng_2', keep)