# Preprocessing

In this notebook I will standardize my numeric features using a scaler and split the data into training and testing datasets. My dummy variables are already created as I made them in the data wrangling step so they would not effect my summary statistics in the EDA step.

### Imports

In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

### Loading Data

In [2]:
game_data = pd.read_csv("data/game_data_cleaned.csv")

### Train/Test Split

I am going to split the data into 70/30 train/test datasets. As the number of suggestions is my target, I will set suggestions_count as my y variable

In [4]:
X_train, X_test, y_train, y_test = train_test_split(game_data.drop(columns='suggestions_count'), 
                                                    game_data.suggestions_count, test_size=0.3, 
                                                    random_state=47)

In [5]:
X_train.shape, X_test.shape

((241504, 73), (103502, 73))

In [6]:
y_train.shape, y_test.shape

((241504,), (103502,))

I will now check that the dtypes of both X sets are still numeric.

In [8]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 241504 entries, 196476 to 313222
Data columns (total 73 columns):
 #   Column                       Non-Null Count   Dtype
---  ------                       --------------   -----
 0   released                     241504 non-null  int64
 1   has_achievements             241504 non-null  bool 
 2   in_series                    241504 non-null  bool 
 3   platform_3DO                 241504 non-null  int64
 4   platform_Android             241504 non-null  int64
 5   platform_Apple II            241504 non-null  int64
 6   platform_Atari 2600          241504 non-null  int64
 7   platform_Atari 5200          241504 non-null  int64
 8   platform_Atari 7800          241504 non-null  int64
 9   platform_Atari 8-bit         241504 non-null  int64
 10  platform_Atari Flashback     241504 non-null  int64
 11  platform_Atari Lynx          241504 non-null  int64
 12  platform_Atari ST            241504 non-null  int64
 13  platform_Atari XEGS     

In [9]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103502 entries, 177663 to 64517
Data columns (total 73 columns):
 #   Column                       Non-Null Count   Dtype
---  ------                       --------------   -----
 0   released                     103502 non-null  int64
 1   has_achievements             103502 non-null  bool 
 2   in_series                    103502 non-null  bool 
 3   platform_3DO                 103502 non-null  int64
 4   platform_Android             103502 non-null  int64
 5   platform_Apple II            103502 non-null  int64
 6   platform_Atari 2600          103502 non-null  int64
 7   platform_Atari 5200          103502 non-null  int64
 8   platform_Atari 7800          103502 non-null  int64
 9   platform_Atari 8-bit         103502 non-null  int64
 10  platform_Atari Flashback     103502 non-null  int64
 11  platform_Atari Lynx          103502 non-null  int64
 12  platform_Atari ST            103502 non-null  int64
 13  platform_Atari XEGS      

### Scaling the Data

I will now apply a standard scaler to put the features on a consistent scale.

In [10]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

My data is now split into train and test sets and scaled to be ready for testing models in the next step.