# VIDEO GAMES SALES PREDICTIONS (2024)

### Library

In [1]:
# Data Table 
import numpy as np                       # matrices & arrays
import pandas as pd                      # Data Table & dataframe 
from skimpy import skim                  # skim data
from prettytable import PrettyTable      # Create Tables

# Visualization 
import seaborn as sns
import matplotlib.pyplot as plt

# Hypothesis Testing 
import scipy.stats as sps                 # statistical tests
from scipy.stats.mstats import winsorize  # Winsorizing
import statsmodels.api as sm              # regression
from statsmodels.formula.api import ols   # regression model
from scipy.stats import boxcox            # ideal way to transform skewed to normal

# Machine Learning 
import sklearn as sklearn                 # scikit learn package
from sklearn.preprocessing import PolynomialFeatures
                                          # Polynomial Features
from sklearn.decomposition import PCA     # PCA
from sklearn.feature_extraction import DictVectorizer
                                          # Categorical encoding
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder, LabelBinarizer, OneHotEncoder
                                          # Categorical Encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
                                          # Continuous Scalers 
from sklearn.model_selection import train_test_split
                                          # split train & test set
from sklearn.model_selection import KFold, cross_val_predict
                                          # Cross-validation 
from sklearn.pipeline import Pipeline
                                          # Pipeline the cross validation

from sklearn.model_selection import GridSearchCV
                                          # Grid Search CV 
from sklearn.metrics import r2_score, mean_squared_error
                                          # Model Evaluation Metrics

# Options
import warnings
warnings.filterwarnings('ignore')         # suppress all warnings; switch 'ignore' to 'default' to to re-enable it again
pd.set_option('display.max_rows', 500)    # display max rows 
pd.set_option('display.max_columns', 500) #         max cols
pd.set_option('display.width', 1000)      #         max width
pd.set_option('display.precision', 2)     #         round 2 places after decimal 

### Functions

##### nadup()

In [2]:
# Functions for calculation
def nadup(df):
    arr, arr2, arr3, arr4 = [], [], [], []
    for col in df.columns:
      temp = [str(x) for x in df[col].unique()]    # convert cols to str
      temp2 = df[col].isna().sum()                 # calculate sum of NaN    
      temp3 = (df[col].isna().sum())/len(df)*100   # calculate % of NaN
      arr.append(', '.join(temp))                   
      arr2.append(len(temp))
      arr3.append(temp2)
      arr4.append(round(temp3,1))
    print('The dataframe has a total of %i rows & %i columns. A total of %i NA values were detected.\n' %(df.shape[0],df.shape[1],df.isnull().any(axis=1).sum()),
          ' This dataframe has',df.duplicated().sum(),'duplicated rows')
    summary = pd.DataFrame({
        'Variables': df.columns,
        'Type':df.dtypes.to_list(),
        'Unique Values':arr,
        'Sum of Unique Values': arr2,
        'Sum of NaN Values': arr3,
        '% of NaN': arr4}).sort_values('% of NaN', ascending = False)
    display(summary)

### About the dataset
This dataset was created by user `SID_TWR` on [Kaggle](https://www.kaggle.com/), a Machine Learning and Data Science Community in 2019. As of February 2nd, 2024, the dataset has 243,000 views and 46,300 downloads. 

According to the owner (i.e., user `SID_TWR`), this dataset was inspired by Gregory Smith's webscrape of [VGChartz Video Games Sales dataset](https://www.kaggle.com/datasets/mathurtanvi/video-game-sales-dataset). This dataset contains the name, platform, year of release, genre, publisher, sales of video games in North America, Europe, Japan, Other countries, and total global sales. The dataset I use for this project extends the variables, in which the owner performed webscraping on [Metacritic](https://www.metacritic.com/) for the following variables: critic scores, critic count, user score, user count, developer, and ESRB ratings of each video games, along with the video game sales variable mentioned above. 

This dataset can be accessed and downloaded here: [Video Games Sales Dataset,
Video Games Sales & Game Ratings Data Scraped from VzCharts](https://www.kaggle.com/datasets/sidtwr/videogames-sales-dataset/data?select=Video_Games_Sales_as_at_22_Dec_2016.csv)

In [12]:
# importing
games = pd.read_csv('sales.csv')

# rename 
games.columns = ['name','platform','release_year',
                 'genre','publisher','na_sales','eu_sales',
                 'jp_sales','other_sales','global_sales',
                 'critic_scores','n_critic','user_scores',
                 'n_user','developer','esrb']

# Information
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16719 entries, 0 to 16718
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   game           16717 non-null  object 
 1   platform       16719 non-null  object 
 2   release_year   16450 non-null  float64
 3   genre          16717 non-null  object 
 4   publisher      16665 non-null  object 
 5   na_sales       16719 non-null  float64
 6   eu_sales       16719 non-null  float64
 7   jp_sales       16719 non-null  float64
 8   other_sales    16719 non-null  float64
 9   global_sales   16719 non-null  float64
 10  critic_scores  8137 non-null   float64
 11  n_critic       8137 non-null   float64
 12  user_scores    7590 non-null   float64
 13  n_user         7590 non-null   float64
 14  developer      10096 non-null  object 
 15  esrb           9950 non-null   object 
dtypes: float64(10), object(6)
memory usage: 2.0+ MB


### Goals
For this project, I will try to predict the global sales of video games using multiple predictors. This project will place a heavier emphasis on prediction and less on explanation. 
1. **Descriptive.** 
- Visualize some of the top 10 or top 25 video games, genres, or publishers with the highest sales. 
- Identify the association between global sales and various predictors. 

2. **Predictive.** 
- Construct two algorithms to predict global sales, one with features engineering and one without feature engineering, using various predictors. 
- Utilize different metrics to identify the best algorithm. 

### Questions 
##### Descriptive questions. 
1. What are the top 25 games that achieve the highest sales? 
2. Does critic and users have different opinion when rating a video game? 
3. What are the top 10 most successful publishers? 
4. What video game genres have the highest global sales? 
5. Does video game get better over the year? 
6. What are the association between global sales of video games and each predictors? 

##### Predictive Questions. 
1. ***Can we predict global sales of video games from the following predictors:***
    - Platform
    - Released year
    - Genre
    - Publisher
    - Critic Scores 
    - Number of critics 
    - User Scores
    - Number of Users 
    - Developer
    - ESRB rating 
2. ***Can we predict global sales of video games with features engineering?***
    - Each individual features
    - Platform * Genre 
    - Publisher * Genre
    - Released Year * Genre 
    - Platform * Publisher * Released Year 
    - Platform * Publisher * Released Year * Genre 

# Methods

### Analysis Plan
1. **Data Cleaning.** How data will be processed and clean (i.e., duplicates, NA values, outliers) 
- `duplicates`. All duplicates rows will be removed. 
- `NA values.` 
    1. For each variable with NA values, a separate columns will be created with binary variable 0 & 1, with 1 = missing NA value and 0 = no missing NA values (e.g., a `n_critic` column will have `n_critic_na`). 
    2. Estimation. For each column with NA value, an estimated column will be made to estimate those NA values using its sample mean or median (e.g., a `n_critic` column will also have `n_critic_est`). 
    3. A chi-squared test (for categorical variable) and shapiro-wilk (for continuous variable) will be conducted between the original column and the estimated column to see if the distribution between both columns are significantly different. 
2. Exploratory Data Analysis (EDA). Multiple EDA will be performed to answer the questions established above. 
    - Q1. A bar graph illustrating the top 25 games with highest global sales.
    - Q2. A t-test will be conducted to compare rating scores between critics and users. A bar graph will be provided to visualize this comparison. 
    - Q3. A bar graph illustrating the top 10 publishers with the highest global sales. 
    - Q4. A bar graph illustrating the top 10 genres with the highest global sales. 
    - Q5. 
1. What are the top 25 games that achieve the highest sales? 
2. Does critic and users have different opinion when rating a video game? 
3. What are the top 10 most successful publishers? 
4. What video game genres have the highest global sales? 
5. Does video game get better over the year? 
6. What are the association between global sales of video games and each predictors? 
    - What are some relationship between predictors and outcomes?
    - Do we need a pairplot? 
    - What are the correlation between all predictors and outcomes?
    - If any, what are some hypotheses testing we should use? 
3. Pre-processing steps
    - Do we need to perform variable transformation? (for example, instead of released year, we use year from first year)
    - Do we need to perform feature engineering? If so, why?
    - Will this be a classification or regression model? 
4. Data split 
    - How many percentage will be test data? 
    - Do we perform cross-validation? 
5. Building the model
    - What models will we used? (linear regression, ridge, lasso, etc.)
    - What evaluation metric will we used? (R2, confusion matrix, etc.)
    - Comparison table
    - Plot of all prediction 
6. Conclusion.

### Variables of interest
|Variables|Class|Type|Definition, Unit, Scale|Interpretation|
|---------|-----|----|----------|--------------|
|GDP|continuous|outcome|The amount of money a country make in a year ($)|Higher GDP = Richer country|
|Age|continuous|predictor|a person age, in year|higher age = older|
|Sex|categorical|predictor|a person's biological sex|M = male, F = female|

mm

# DATA CLEANING 

### Duplicates

In [None]:
nadup(sales)

### NA values

# Exploratory Data Analysis (EDA)

### Pairplot 

# Pre-processing 

### Standardization 

### Outliers

### Features Engineering 

# Data Splits

### Train-Test Split

### Cross-validation 

# Supervised ML Model - Regression

### Model 1

### Model 2

### Model 3

### Evaluation Metric

##### Comparison Table

##### Plot of all Prediction

# CONCLUSION

# References

### Minh K. Chau