# Do Different Factors Affect Video Game Sales in Different Regions?

## Thesis Statement

The goal of this project is to investigate whether specific features of a video game — such as its publisher, platform, genre, critic score, or user score — are strongly correlated with its sales performance in different regions. I hypothesize that different markets (North America, Europe, and Japan) emphasize different factors when purchasing games. For example, Nintendo-published titles may perform better in Japan, while highly rated games may drive more sales in North America.

## Plan for Analysis

To test this hypothesis, I will:
- Analyze the correlation between various game attributes and regional sales (`NA_Sales`, `EU_Sales`, `JP_Sales`).
- Use regression models to quantify the strength of influence each feature has on sales.
- Compare and interpret the results across regions to identify patterns or regional preferences.

The outcome of this analysis will reveal whether certain attributes have a consistent or region-specific effect on sales, and how strongly internal features alone can explain commercial success.

## Data Source

The data used in this project comes from the Kaggle dataset:
- URL: https://www.kaggle.com/datasets/rush4ratio/video-game-sales-with-ratings
- Citation: Rush Kirubi. (2017). *Video Game Sales with Ratings*. Kaggle.

## AI Usage Statement

Some brainstorming assistance was provided using ChatGPT to refine the thesis statement, analysis plan, and project organization. Final writing, analysis, and conclusions are solely my own work.


In [63]:
import pandas as pd

df = pd.read_csv('Video_Games_Sales_as_at_22_Dec_2016.csv')
df.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


In [64]:
# Drop rows where Critic_Score or User_Score is missing originally
df_mod = df.dropna(subset=['Critic_Score', 'User_Score']).copy()

# Fill missing for categorical columns
for col in ['Publisher', 'Developer', 'Rating']:
    df_mod[col] = df_mod[col].fillna('Unknown')

# Fill missing numerical year with -1
df_mod['Year_of_Release'] = df_mod['Year_of_Release'].fillna(-1)

# Now fix User_Score properly
df_mod['User_Score'] = pd.to_numeric(df_mod['User_Score'], errors='coerce')

# Drop rows where User_Score is NaN AFTER conversion (important!)
df_mod = df_mod.dropna(subset=['User_Score'])

# Drop extra columns not needed
df_mod = df_mod.drop(['Critic_Count', 'User_Count'], axis=1)

# Final missing check
print(df_mod.isnull().sum())

Name               0
Platform           0
Year_of_Release    0
Genre              0
Publisher          0
NA_Sales           0
EU_Sales           0
JP_Sales           0
Other_Sales        0
Global_Sales       0
Critic_Score       0
User_Score         0
Developer          0
Rating             0
dtype: int64


In [65]:
print(df_mod.describe())


       Year_of_Release     NA_Sales     EU_Sales     JP_Sales  Other_Sales  \
count      7017.000000  7017.000000  7017.000000  7017.000000  7017.000000   
mean       1972.275901     0.389290     0.233095     0.062951     0.081525   
std         263.627539     0.957051     0.679210     0.284162     0.266594   
min          -1.000000     0.000000     0.000000     0.000000     0.000000   
25%        2004.000000     0.060000     0.020000     0.000000     0.010000   
50%        2007.000000     0.150000     0.060000     0.000000     0.020000   
75%        2011.000000     0.390000     0.210000     0.010000     0.070000   
max        2016.000000    41.360000    28.960000     6.500000    10.570000   

       Global_Sales  Critic_Score   User_Score  
count   7017.000000   7017.000000  7017.000000  
mean       0.767049     70.249822     7.182428  
std        1.940317     13.880646     1.441241  
min        0.010000     13.000000     0.500000  
25%        0.110000     62.000000     6.500000  
50%

In [66]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Setup targets
y_na = df_mod['NA_Sales']
y_eu = df_mod['EU_Sales']
y_jp = df_mod['JP_Sales']
y_oth = df_mod['Other_Sales']
y_global = df_mod['Global_Sales']

# Set categorical features
cat_features = ['Platform', 'Genre', 'Publisher', 'Rating']

# OneHotEncoder with dense output
oneHotEnc = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# ColumnTransformer
coltrans = ColumnTransformer(
    transformers=[
        ("onehot", oneHotEnc, cat_features)
    ],
    remainder='passthrough',
    verbose_feature_names_out=False
)

# Drop unnecessary columns
X_features = df_mod.drop(
    ['Name', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales', 'Developer', 'Year_of_Release'],
    axis=1
)

# Transform
X_trans = coltrans.fit_transform(X_features)

print(X_trans.shape)


(7017, 311)


In [67]:
from sklearn.model_selection import train_test_split

# Split features and target (for NA_Sales first)
X_train_na, X_test_na, y_train_na, y_test_na = train_test_split(X_trans, y_na, test_size=0.2, random_state=42)


In [68]:
import numpy as np

# Check for any NaNs in X_trans
print(np.isnan(X_trans).sum())

0
