# Random Forest Model: Total_Sales < 1 M

## To predict the class of the low total sales based on different features
The majority of the games that has total sales less than 1 millions and is categorized as low total sales, are extracted to be used as the input data for this model to predict whether the class of the low total sales are '0' (total sales < 0.2 M) or '1' (0.2 M < total sales < 1 M), and to determine whether the the importance of the feature variables differ from the first model where the input data includes all total sales, both low and high (from 0.01 - 82.86 millions).

### Target Variable and Features
- Target variable (y) = Low_Sales_Class
- X = Genre, ESRB_Rating, Platform, Publisher, Developer_x

### Machine Learning Models
- rf_model = RandomForestClassifier
- brf_model = BalancedRandomForestClassifier
- eec_model = EasyEnsembleClassifier


In [12]:
import pandas as pd
import numpy as np

from pathlib import Path
from collections import Counter

from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import confusion_matrix
from imblearn.metrics import classification_report_imbalanced

In [47]:
# Load the dataset from AWS S3 bucket
#games_df = pd.read_csv('https://video-game-dataset-uot-boot-camp-2022-group-4.s3.us-east-2.amazonaws.com/all_columns_df.csv')
games_df = pd.read_csv('Cleaned_Data/all_columns_df.csv')
games_df

Unnamed: 0,Rank,Name,Genre,ESRB_Rating,Platform,Publisher,Developer_x,Critic_Score,User_Score,Year,Country,Total_Sales
0,1,Wii Sports,Sports,E,Wii,Nintendo,Nintendo EAD,7.7,,2006.0,Japan,82.86
1,2,Super Mario Bros.,Platform,,NES,Nintendo,Nintendo EAD,10.0,,1985.0,Japan,40.24
2,3,Mario Kart Wii,Racing,E,Wii,Nintendo,Nintendo EAD,8.2,9.1,2008.0,Japan,37.14
3,4,PlayerUnknown's Battlegrounds,Shooter,,PC,PUBG Corporation,PUBG Corporation,,,2017.0,,36.60
4,5,Wii Sports Resort,Sports,E,Wii,Nintendo,Nintendo EAD,8.0,8.8,2009.0,Japan,33.09
...,...,...,...,...,...,...,...,...,...,...,...,...
19857,19858,FirePower for Microsoft Combat Flight Simulator 3,Simulation,T,PC,GMX Media,Shockwave Productions,,,2004.0,,0.01
19858,19859,Tom Clancy's Splinter Cell,Shooter,T,PC,Ubisoft,Ubisoft,,,2003.0,Europe,0.01
19859,19860,Ashita no Joe 2: The Anime Super Remix,Fighting,,PS2,Capcom,Capcom,,,2002.0,Japan,0.01
19860,19861,Tokyo Yamanote Boys for V: Main Disc,Adventure,,PSV,Rejet,Rejet,,,2017.0,,0.01


In [48]:
games_df.dtypes

Rank              int64
Name             object
Genre            object
ESRB_Rating      object
Platform         object
Publisher        object
Developer_x      object
Critic_Score    float64
User_Score      float64
Year            float64
Country          object
Total_Sales     float64
dtype: object

In [49]:
# Change 'Year' dtype to object
games_df['Year'] = pd.to_datetime(games_df['Year'], format = '%Y').dt.strftime('%Y')
games_df.dtypes

Rank              int64
Name             object
Genre            object
ESRB_Rating      object
Platform         object
Publisher        object
Developer_x      object
Critic_Score    float64
User_Score      float64
Year             object
Country          object
Total_Sales     float64
dtype: object

In [50]:
# Drop columns that won't be included in the analysis
games_df.drop(['Rank'], axis=1, inplace=True)
games_df

Unnamed: 0,Name,Genre,ESRB_Rating,Platform,Publisher,Developer_x,Critic_Score,User_Score,Year,Country,Total_Sales
0,Wii Sports,Sports,E,Wii,Nintendo,Nintendo EAD,7.7,,2006,Japan,82.86
1,Super Mario Bros.,Platform,,NES,Nintendo,Nintendo EAD,10.0,,1985,Japan,40.24
2,Mario Kart Wii,Racing,E,Wii,Nintendo,Nintendo EAD,8.2,9.1,2008,Japan,37.14
3,PlayerUnknown's Battlegrounds,Shooter,,PC,PUBG Corporation,PUBG Corporation,,,2017,,36.60
4,Wii Sports Resort,Sports,E,Wii,Nintendo,Nintendo EAD,8.0,8.8,2009,Japan,33.09
...,...,...,...,...,...,...,...,...,...,...,...
19857,FirePower for Microsoft Combat Flight Simulator 3,Simulation,T,PC,GMX Media,Shockwave Productions,,,2004,,0.01
19858,Tom Clancy's Splinter Cell,Shooter,T,PC,Ubisoft,Ubisoft,,,2003,Europe,0.01
19859,Ashita no Joe 2: The Anime Super Remix,Fighting,,PS2,Capcom,Capcom,,,2002,Japan,0.01
19860,Tokyo Yamanote Boys for V: Main Disc,Adventure,,PSV,Rejet,Rejet,,,2017,,0.01


In [51]:
# Sattistics of the Total_Sales column
games_df.describe()

Unnamed: 0,Critic_Score,User_Score,Total_Sales
count,4706.0,238.0,19862.0
mean,7.269911,8.465546,0.530876
std,1.420956,1.215681,1.572634
min,1.0,2.0,0.01
25%,6.5,8.0,0.05
50%,7.5,8.8,0.16
75%,8.3,9.3,0.45
max,10.0,10.0,82.86


## Bin Total_Sales and Create 'Total_Sales_Class' column

In [52]:
# Create bin for 'Total_Sales' column
bins = [0,1,10,100]
labels = ['low', 'medium', 'high']

In [53]:
# Bin 'Total_Sales' into new column
games_df['Total_Sales_Class'] = pd.cut(games_df['Total_Sales'], bins=bins, labels=labels, right=False)
games_df

Unnamed: 0,Name,Genre,ESRB_Rating,Platform,Publisher,Developer_x,Critic_Score,User_Score,Year,Country,Total_Sales,Total_Sales_Class
0,Wii Sports,Sports,E,Wii,Nintendo,Nintendo EAD,7.7,,2006,Japan,82.86,high
1,Super Mario Bros.,Platform,,NES,Nintendo,Nintendo EAD,10.0,,1985,Japan,40.24,high
2,Mario Kart Wii,Racing,E,Wii,Nintendo,Nintendo EAD,8.2,9.1,2008,Japan,37.14,high
3,PlayerUnknown's Battlegrounds,Shooter,,PC,PUBG Corporation,PUBG Corporation,,,2017,,36.60,high
4,Wii Sports Resort,Sports,E,Wii,Nintendo,Nintendo EAD,8.0,8.8,2009,Japan,33.09,high
...,...,...,...,...,...,...,...,...,...,...,...,...
19857,FirePower for Microsoft Combat Flight Simulator 3,Simulation,T,PC,GMX Media,Shockwave Productions,,,2004,,0.01,low
19858,Tom Clancy's Splinter Cell,Shooter,T,PC,Ubisoft,Ubisoft,,,2003,Europe,0.01,low
19859,Ashita no Joe 2: The Anime Super Remix,Fighting,,PS2,Capcom,Capcom,,,2002,Japan,0.01,low
19860,Tokyo Yamanote Boys for V: Main Disc,Adventure,,PSV,Rejet,Rejet,,,2017,,0.01,low


In [54]:
games_df.Total_Sales_Class.value_counts()

low       17420
medium     2355
high         87
Name: Total_Sales_Class, dtype: int64

## Look at the low total_sales (Total_Sales < 1M)

In [55]:
low_sales_df = games_df.loc[games_df['Total_Sales_Class'] == 'low']
low_sales_df

Unnamed: 0,Name,Genre,ESRB_Rating,Platform,Publisher,Developer_x,Critic_Score,User_Score,Year,Country,Total_Sales,Total_Sales_Class
2442,NFL Fever 2002,Sports,E,XB,Microsoft,Microsoft,,,2001,United States,0.99,low
2443,Haze,Shooter,M,PS3,Ubisoft,Free Radical Design,5.6,,2008,Europe,0.99,low
2444,The Simpsons: Hit & Run,Racing,T,GC,VU Games,Radical Entertainment,8.2,,2003,,0.99,low
2445,Oddworld: Abe's Exoddus,Platform,T,PS,GT Interactive,Oddworld Inhabitans,8.6,,1998,,0.99,low
2446,Tales of Graces f,Role-Playing,T,PS3,Namco Bandai,Namco Tales Studio,,8.0,2012,Japan,0.99,low
...,...,...,...,...,...,...,...,...,...,...,...,...
19857,FirePower for Microsoft Combat Flight Simulator 3,Simulation,T,PC,GMX Media,Shockwave Productions,,,2004,,0.01,low
19858,Tom Clancy's Splinter Cell,Shooter,T,PC,Ubisoft,Ubisoft,,,2003,Europe,0.01,low
19859,Ashita no Joe 2: The Anime Super Remix,Fighting,,PS2,Capcom,Capcom,,,2002,Japan,0.01,low
19860,Tokyo Yamanote Boys for V: Main Disc,Adventure,,PSV,Rejet,Rejet,,,2017,,0.01,low


In [56]:
low_sales_df.count()

Name                 17420
Genre                17420
ESRB_Rating          11820
Platform             17420
Publisher            17420
Developer_x          17418
Critic_Score          3405
User_Score             107
Year                 17417
Country               9868
Total_Sales          17420
Total_Sales_Class    17420
dtype: int64

In [57]:
low_sales_df.describe()

Unnamed: 0,Critic_Score,User_Score,Total_Sales
count,3405.0,107.0,17420.0
mean,6.956035,8.060748,0.210579
std,1.409481,1.331247,0.221424
min,1.0,2.0,0.01
25%,6.2,7.6,0.05
50%,7.2,8.2,0.13
75%,8.0,9.0,0.3
max,9.7,10.0,0.99


In [58]:
low_sales_df.drop(['Total_Sales_Class'], axis=1, inplace=True)
low_sales_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  low_sales_df.drop(['Total_Sales_Class'], axis=1, inplace=True)


Unnamed: 0,Name,Genre,ESRB_Rating,Platform,Publisher,Developer_x,Critic_Score,User_Score,Year,Country,Total_Sales
2442,NFL Fever 2002,Sports,E,XB,Microsoft,Microsoft,,,2001,United States,0.99
2443,Haze,Shooter,M,PS3,Ubisoft,Free Radical Design,5.6,,2008,Europe,0.99
2444,The Simpsons: Hit & Run,Racing,T,GC,VU Games,Radical Entertainment,8.2,,2003,,0.99
2445,Oddworld: Abe's Exoddus,Platform,T,PS,GT Interactive,Oddworld Inhabitans,8.6,,1998,,0.99
2446,Tales of Graces f,Role-Playing,T,PS3,Namco Bandai,Namco Tales Studio,,8.0,2012,Japan,0.99
...,...,...,...,...,...,...,...,...,...,...,...
19857,FirePower for Microsoft Combat Flight Simulator 3,Simulation,T,PC,GMX Media,Shockwave Productions,,,2004,,0.01
19858,Tom Clancy's Splinter Cell,Shooter,T,PC,Ubisoft,Ubisoft,,,2003,Europe,0.01
19859,Ashita no Joe 2: The Anime Super Remix,Fighting,,PS2,Capcom,Capcom,,,2002,Japan,0.01
19860,Tokyo Yamanote Boys for V: Main Disc,Adventure,,PSV,Rejet,Rejet,,,2017,,0.01


In [65]:
# Create bin for 'Total_Sales' column
bins = [0,0.2,1]
labels = ['0', '1']

In [66]:
# Bin 'Total_Sales' into new column
low_sales_df['Low_Sales_Class'] = pd.cut(low_sales_df['Total_Sales'], bins=bins, labels=labels, right=False)
low_sales_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  low_sales_df['Low_Sales_Class'] = pd.cut(low_sales_df['Total_Sales'], bins=bins, labels=labels, right=False)


Unnamed: 0,Name,Genre,ESRB_Rating,Platform,Publisher,Developer_x,Critic_Score,User_Score,Year,Country,Total_Sales,Low_Sales_Class
2442,NFL Fever 2002,Sports,E,XB,Microsoft,Microsoft,,,2001,United States,0.99,1
2443,Haze,Shooter,M,PS3,Ubisoft,Free Radical Design,5.6,,2008,Europe,0.99,1
2444,The Simpsons: Hit & Run,Racing,T,GC,VU Games,Radical Entertainment,8.2,,2003,,0.99,1
2445,Oddworld: Abe's Exoddus,Platform,T,PS,GT Interactive,Oddworld Inhabitans,8.6,,1998,,0.99,1
2446,Tales of Graces f,Role-Playing,T,PS3,Namco Bandai,Namco Tales Studio,,8.0,2012,Japan,0.99,1
...,...,...,...,...,...,...,...,...,...,...,...,...
19857,FirePower for Microsoft Combat Flight Simulator 3,Simulation,T,PC,GMX Media,Shockwave Productions,,,2004,,0.01,0
19858,Tom Clancy's Splinter Cell,Shooter,T,PC,Ubisoft,Ubisoft,,,2003,Europe,0.01,0
19859,Ashita no Joe 2: The Anime Super Remix,Fighting,,PS2,Capcom,Capcom,,,2002,Japan,0.01,0
19860,Tokyo Yamanote Boys for V: Main Disc,Adventure,,PSV,Rejet,Rejet,,,2017,,0.01,0


In [67]:
low_sales_df.Low_Sales_Class.value_counts()

0    10931
1     6489
Name: Low_Sales_Class, dtype: int64

## Dropping NaNs

In [69]:
# Drop unnecessary columns
low_sales_df.drop(['Name','User_Score', 'Total_Sales', 'Country'], axis=1, inplace=True)
low_sales_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  low_sales_df.drop(['Name','User_Score', 'Total_Sales', 'Country'], axis=1, inplace=True)


Unnamed: 0,Genre,ESRB_Rating,Platform,Publisher,Developer_x,Critic_Score,Year,Low_Sales_Class
2442,Sports,E,XB,Microsoft,Microsoft,,2001,1
2443,Shooter,M,PS3,Ubisoft,Free Radical Design,5.6,2008,1
2444,Racing,T,GC,VU Games,Radical Entertainment,8.2,2003,1
2445,Platform,T,PS,GT Interactive,Oddworld Inhabitans,8.6,1998,1
2446,Role-Playing,T,PS3,Namco Bandai,Namco Tales Studio,,2012,1
...,...,...,...,...,...,...,...,...
19857,Simulation,T,PC,GMX Media,Shockwave Productions,,2004,0
19858,Shooter,T,PC,Ubisoft,Ubisoft,,2003,0
19859,Fighting,,PS2,Capcom,Capcom,,2002,0
19860,Adventure,,PSV,Rejet,Rejet,,2017,0


In [70]:
low_sales_df.dropna().count()

Genre              3294
ESRB_Rating        3294
Platform           3294
Publisher          3294
Developer_x        3294
Critic_Score       3294
Year               3294
Low_Sales_Class    3294
dtype: int64

In [71]:
# Drop all NaNs values
low_sales_df = low_sales_df.dropna()
print(low_sales_df.shape)

(3294, 8)


In [72]:
# Check unique values
low_sales_df.nunique()

Genre               19
ESRB_Rating          5
Platform            24
Publisher          206
Developer_x        965
Critic_Score        83
Year                26
Low_Sales_Class      2
dtype: int64

## Bucket data to top 10 and other bins

In [73]:
# Keep top 15 of Genre
top = low_sales_df.Genre.value_counts().index[0:15]
low_sales_df.Genre = np.where(low_sales_df.Genre.isin(top), low_sales_df.Genre,'other')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  low_sales_df.Genre = np.where(low_sales_df.Genre.isin(top), low_sales_df.Genre,'other')


In [74]:
# Keep top 15 of Platform
top = low_sales_df.Platform.value_counts().index[0:15]
low_sales_df.Platform = np.where(low_sales_df.Platform.isin(top), low_sales_df.Platform,'other')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  low_sales_df.Platform = np.where(low_sales_df.Platform.isin(top), low_sales_df.Platform,'other')


In [75]:
# Keep top 15 of Publisher
top = low_sales_df.Publisher.value_counts().index[0:15]
low_sales_df.Publisher = np.where(low_sales_df.Publisher.isin(top), low_sales_df.Publisher, 'other')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  low_sales_df.Publisher = np.where(low_sales_df.Publisher.isin(top), low_sales_df.Publisher, 'other')


In [76]:
# Keep top 15 of Developer_x
top = low_sales_df.Developer_x.value_counts().index[0:15]
low_sales_df.Developer_x = np.where(low_sales_df.Developer_x.isin(top), low_sales_df.Developer_x,'other')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  low_sales_df.Developer_x = np.where(low_sales_df.Developer_x.isin(top), low_sales_df.Developer_x,'other')


In [32]:
# Keep top 15 of Year
top = low_sales_df.Year.value_counts().index[0:15]
low_sales_df.Year = np.where(low_sales_df.Year.isin(top), low_sales_df.Year, 'other')

In [77]:
low_sales_df

Unnamed: 0,Genre,ESRB_Rating,Platform,Publisher,Developer_x,Critic_Score,Year,Low_Sales_Class
2443,Shooter,M,PS3,Ubisoft,other,5.6,2008,1
2444,Racing,T,GC,other,other,8.2,2003,1
2445,Platform,T,PS,other,other,8.6,1998,1
2447,Racing,T,PS3,other,other,8.8,2009,1
2450,Adventure,T,DS,Capcom,Capcom,8.0,2008,1
...,...,...,...,...,...,...,...,...
19790,Fighting,T,other,other,other,8.4,2008,0
19792,Shooter,T,PC,Activision,other,7.0,2003,0
19794,Action,E,GBA,Atlus,Atlus Co.,6.0,2006,0
19800,Puzzle,E,GBA,other,other,6.7,2006,0


In [78]:
# Check unique values
low_sales_df.nunique()

Genre              16
ESRB_Rating         5
Platform           16
Publisher          16
Developer_x        16
Critic_Score       83
Year               26
Low_Sales_Class     2
dtype: int64

## Encoding categorical variables

In [81]:
# Assign features
X = low_sales_df.drop('Low_Sales_Class', axis = 1)
X

Unnamed: 0,Genre,ESRB_Rating,Platform,Publisher,Developer_x,Critic_Score,Year
2443,Shooter,M,PS3,Ubisoft,other,5.6,2008
2444,Racing,T,GC,other,other,8.2,2003
2445,Platform,T,PS,other,other,8.6,1998
2447,Racing,T,PS3,other,other,8.8,2009
2450,Adventure,T,DS,Capcom,Capcom,8.0,2008
...,...,...,...,...,...,...,...
19790,Fighting,T,other,other,other,8.4,2008
19792,Shooter,T,PC,Activision,other,7.0,2003
19794,Action,E,GBA,Atlus,Atlus Co.,6.0,2006
19800,Puzzle,E,GBA,other,other,6.7,2006


In [82]:
X.dtypes

Genre            object
ESRB_Rating      object
Platform         object
Publisher        object
Developer_x      object
Critic_Score    float64
Year             object
dtype: object

In [83]:
# Encoding object dtype columns
X_cat = X.select_dtypes(include='object')
X_cat = list(X_cat.columns)
X_cat

['Genre', 'ESRB_Rating', 'Platform', 'Publisher', 'Developer_x', 'Year']

In [84]:
from sklearn.preprocessing import OneHotEncoder

# creating instance of one-hot-encoder
enc = OneHotEncoder(sparse=False)
# Fit and transform the OneHotEncoder using the categorical variable list
encode_df = pd.DataFrame(enc.fit_transform(X[X_cat]))

# Add the encoded variable names to the dataframe
encode_df.columns = enc.get_feature_names(X_cat)

encode_df



Unnamed: 0,Genre_Action,Genre_Action-Adventure,Genre_Adventure,Genre_Fighting,Genre_Misc,Genre_Music,Genre_Party,Genre_Platform,Genre_Puzzle,Genre_Racing,...,Year_2010,Year_2011,Year_2012,Year_2013,Year_2014,Year_2015,Year_2016,Year_2017,Year_2018,Year_2020
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3289,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3290,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3291,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3292,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [85]:
# Reset X dataframe index to merge with encode_df
X.reset_index(drop=True, inplace=True)
X

Unnamed: 0,Genre,ESRB_Rating,Platform,Publisher,Developer_x,Critic_Score,Year
0,Shooter,M,PS3,Ubisoft,other,5.6,2008
1,Racing,T,GC,other,other,8.2,2003
2,Platform,T,PS,other,other,8.6,1998
3,Racing,T,PS3,other,other,8.8,2009
4,Adventure,T,DS,Capcom,Capcom,8.0,2008
...,...,...,...,...,...,...,...
3289,Fighting,T,other,other,other,8.4,2008
3290,Shooter,T,PC,Activision,other,7.0,2003
3291,Action,E,GBA,Atlus,Atlus Co.,6.0,2006
3292,Puzzle,E,GBA,other,other,6.7,2006


In [86]:
# Merge one-hot encoded features and drop the originals
X = X.merge(encode_df, left_index=True, right_index=True)
X = X.drop(X_cat,1)
X

  X = X.drop(X_cat,1)


Unnamed: 0,Critic_Score,Genre_Action,Genre_Action-Adventure,Genre_Adventure,Genre_Fighting,Genre_Misc,Genre_Music,Genre_Party,Genre_Platform,Genre_Puzzle,...,Year_2010,Year_2011,Year_2012,Year_2013,Year_2014,Year_2015,Year_2016,Year_2017,Year_2018,Year_2020
0,5.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,8.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,8.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,8.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,8.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3289,8.4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3290,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3291,6.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3292,6.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [87]:
# Assign the target
y = low_sales_df['Low_Sales_Class']
y.value_counts()

1    1902
0    1392
Name: Low_Sales_Class, dtype: int64

In [88]:
print(X.shape)
print(y.shape)

(3294, 96)
(3294,)


## Spliting and scale the data

In [89]:
# Split data to training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Check the balance of the target variables.
print(f"y_train: {Counter(y_train)}")
print(f"y_test: {Counter(y_test)}")

y_train: Counter({'1': 1428, '0': 1042})
y_test: Counter({'1': 474, '0': 350})


In [90]:
# Creating a StandardScaler instance.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fitting the Standard Scaler with the training data.
X_scaler = scaler.fit(X_train)

# Scaling the data.
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

## Random Forest Classifier Model

In [99]:
# Create a random forest classifier.
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=128, random_state=78) 

In [100]:
# Fitting the model
rf_model = rf_model.fit(X_train_scaled, y_train)

In [101]:
# Making predictions using the testing data.
y_pred_rf = rf_model.predict(X_test_scaled)

In [102]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
cm = confusion_matrix(y_test, y_pred_rf)

# Create a DataFrame from the confusion matrix.
#cm_df = pd.DataFrame(
#    cm, index=["Actual high", "Actual low"], columns=["Predicted high", "Predicted low"])

#cm_df
cm

array([[203, 147],
       [106, 368]])

In [103]:
# Calculating the accuracy score.
acc_score = accuracy_score(y_test, y_pred_rf)

In [107]:
# Displaying results
#print("Confusion Matrix")
#display(cm_df)
print('Model: Random Forest Classifier')
print("---------------------")
print(f"Accuracy Score : {acc_score}")
print("---------------------")
print("Classification Report")
print(classification_report(y_test, y_pred_rf))

Model: Random Forest Classifier
---------------------
Accuracy Score : 0.6929611650485437
---------------------
Classification Report
              precision    recall  f1-score   support

           0       0.66      0.58      0.62       350
           1       0.71      0.78      0.74       474

    accuracy                           0.69       824
   macro avg       0.69      0.68      0.68       824
weighted avg       0.69      0.69      0.69       824



In [114]:
# Calculate feature importance in the Random Forest model.
print("Feature Importance: rf model:")
sorted(zip(rf_model.feature_importances_, X.columns), reverse=True)

Feature Importance: rf model:


[(0.19881627502677116, 'Critic_Score'),
 (0.03644248562263341, 'Publisher_other'),
 (0.024504970171840723, 'Platform_PC'),
 (0.023013425057149782, 'ESRB_Rating_T'),
 (0.021854482769481653, 'Platform_PS3'),
 (0.021737330964840654, 'Genre_Action'),
 (0.01915391672620889, 'ESRB_Rating_E'),
 (0.018347560939602398, 'Platform_GBA'),
 (0.01731793113515042, 'Platform_X360'),
 (0.01677117288296144, 'ESRB_Rating_M'),
 (0.016650588544215344, 'ESRB_Rating_E10'),
 (0.016604162722312542, 'Genre_Role-Playing'),
 (0.016526689194358166, 'Year_2006'),
 (0.015971755510082265, 'Developer_x_other'),
 (0.015848775176603873, 'Year_2009'),
 (0.014751893630041292, 'Genre_Platform'),
 (0.014449389234470197, 'Year_2011'),
 (0.014269121703612555, 'Genre_Shooter'),
 (0.01379571612413485, 'Platform_PSP'),
 (0.013713794392244745, 'Year_2008'),
 (0.01344679936336679, 'Year_2007'),
 (0.013351327429948413, 'Platform_GC'),
 (0.01299745301716718, 'Genre_Sports'),
 (0.012804171676842164, 'Year_2005'),
 (0.0124160338858860

## Balanced Random Forest Classifier Model

In [105]:
# Resample the training data with the BalancedRandomForestClassifier
from imblearn.ensemble import BalancedRandomForestClassifier

brf_model = BalancedRandomForestClassifier(n_estimators=128, random_state = 78) 

# Fitting the model
brf_model.fit(X_train, y_train)

In [108]:
# Calculated the balanced accuracy score
y_pred_brf = brf_model.predict(X_test)

from sklearn.metrics import balanced_accuracy_score
brf_acc_score = balanced_accuracy_score(y_test, y_pred_brf)

In [109]:
# Display the confusion matrix
#from sklearn.metrics import confusion_matrix
#pd.DataFrame(
#    confusion_matrix(y_test, y_pred_brf),
#    index=["Actual high", "Actual low"],
#    columns=["Predicted high", "Predicted low"])

In [110]:
# Print the imbalanced classification report
print('Model: Balanced Random Forest Classifier')
print("---------------------")
print(f"Accuracy Score : {brf_acc_score}")
print("---------------------")
print("Classification Report")
print(classification_report_imbalanced(y_test, y_pred_brf))

Model: Balanced Random Forest Classifier
---------------------
Accuracy Score : 0.6926522001205546
---------------------
Classification Report
                   pre       rec       spe        f1       geo       iba       sup

          0       0.61      0.72      0.66      0.66      0.69      0.48       350
          1       0.76      0.66      0.72      0.71      0.69      0.48       474

avg / total       0.70      0.69      0.70      0.69      0.69      0.48       824



In [115]:
# Calculate feature importance in the Random Forest model.
print("Feature Importance: brf model: ")
sorted(zip(brf_model.feature_importances_, X.columns), reverse=True)

Feature Importance: brf model: 


[(0.18943791648509622, 'Critic_Score'),
 (0.035435375649119, 'Publisher_other'),
 (0.024837281295867326, 'Platform_PS3'),
 (0.024233858405716485, 'Platform_PC'),
 (0.02271372613527671, 'ESRB_Rating_T'),
 (0.02162921557931814, 'Genre_Action'),
 (0.019010070010919113, 'ESRB_Rating_E'),
 (0.017917876951522062, 'Platform_X360'),
 (0.017301331047102658, 'Platform_GBA'),
 (0.01702822744588775, 'ESRB_Rating_M'),
 (0.016159631074113372, 'Developer_x_other'),
 (0.015860891197522, 'Year_2009'),
 (0.015649791689759086, 'Year_2006'),
 (0.015591922499179544, 'Genre_Role-Playing'),
 (0.015568662872843217, 'ESRB_Rating_E10'),
 (0.01483710965209902, 'Genre_Shooter'),
 (0.014601241355993602, 'Genre_Platform'),
 (0.014428462694830576, 'Year_2011'),
 (0.013628461156209018, 'Platform_PSP'),
 (0.013625749730586606, 'Genre_Sports'),
 (0.013451352858002072, 'Year_2008'),
 (0.013291639736824241, 'Platform_GC'),
 (0.012953637179987038, 'Year_2005'),
 (0.012824416407813845, 'Genre_Racing'),
 (0.0127779637049791

## Easy Ensemble AdaBoost Classifier Model

In [111]:
# Train the EasyEnsembleClassifier
from imblearn.ensemble import EasyEnsembleClassifier 

eec_model = EasyEnsembleClassifier(n_estimators=128, random_state=78)

eec_model.fit(X_train, y_train)

In [112]:
# Calculated the balanced accuracy score
y_pred_eec = eec_model.predict(X_test)

eec_acc_score = balanced_accuracy_score(y_test, y_pred_eec)

In [59]:
# Display the confusion matrix
#pd.DataFrame(
#    confusion_matrix(y_test, y_pred_eec),
#    index=["Actual high_risk", "Actual low_risk"],
#    columns=["Predicted high_risk", "Predicted low_risk"])

In [113]:
# Print the imbalanced classification report
print('Model: EasyEnsembleClassifier')
print("---------------------")
print(f"Accuracy Score : {eec_acc_score}")
print("---------------------")
print("Classification Report")
print(classification_report_imbalanced(y_test, y_pred_eec))

Model: EasyEnsembleClassifier
---------------------
Accuracy Score : 0.7048704038577456
---------------------
Classification Report
                   pre       rec       spe        f1       geo       iba       sup

          0       0.64      0.71      0.70      0.67      0.70      0.50       350
          1       0.77      0.70      0.71      0.73      0.70      0.50       474

avg / total       0.71      0.70      0.71      0.71      0.70      0.50       824

