# <p style="text-align: center;">Part B: Classification (Supervised Learning)</p>

In [1]:
#Imports
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sklearn.metrics import classification_report
from sklearn.model_selection import KFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from scipy.stats import randint

# 1.0 Classification Using Decision Tree Model

### Loading Data

In [2]:
col_names = ['rank', 'name', 'platform', 'year', 'genre', 'publisher', 'na_sales', 'eu_sales', 'jp_sales', 'other_sales', 'global_sales']
# load dataset
sales = pd.read_csv("./assets/sales_table.csv", header=None, names=col_names)
# Removing the first row since it has duplicate column names.
sales = sales.tail(-1)
sales.head()

Unnamed: 0,rank,name,platform,year,genre,publisher,na_sales,eu_sales,jp_sales,other_sales,global_sales
0.0,14691,Cossacks: European Wars,PC,2001,Strategy,Strategy First,0.0,0.02,0.0,0.0,0.03
1.0,61,Just Dance 3,Wii,2011,Misc,Ubisoft,6.05,3.15,0.0,1.07,10.26
2.0,69,Just Dance 2,Wii,2010,Misc,Ubisoft,5.84,2.89,0.01,0.78,9.52
3.0,103,Just Dance,Wii,2009,Misc,Ubisoft,3.51,3.03,0.0,0.73,7.27
4.0,112,Just Dance 4,Wii,2012,Misc,Ubisoft,4.14,2.21,0.0,0.56,6.91


In [3]:
# Converting attributes from string to a more relevant data type.
sales['rank'] = sales['rank'].astype(int)
sales['year'] = sales['year'].astype(int)
sales['na_sales'] = sales['na_sales'].astype(float)
sales['eu_sales'] = sales['eu_sales'].astype(float)
sales['jp_sales'] = sales['jp_sales'].astype(float)
sales['other_sales'] = sales['other_sales'].astype(float)
sales['global_sales'] = sales['global_sales'].astype(float)
sales.head()

Unnamed: 0,rank,name,platform,year,genre,publisher,na_sales,eu_sales,jp_sales,other_sales,global_sales
0.0,14691,Cossacks: European Wars,PC,2001,Strategy,Strategy First,0.0,0.02,0.0,0.0,0.03
1.0,61,Just Dance 3,Wii,2011,Misc,Ubisoft,6.05,3.15,0.0,1.07,10.26
2.0,69,Just Dance 2,Wii,2010,Misc,Ubisoft,5.84,2.89,0.01,0.78,9.52
3.0,103,Just Dance,Wii,2009,Misc,Ubisoft,3.51,3.03,0.0,0.73,7.27
4.0,112,Just Dance 4,Wii,2012,Misc,Ubisoft,4.14,2.21,0.0,0.56,6.91


In [4]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
Float64Index: 2338 entries, 0.0 to 2337.0
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   rank          2338 non-null   int64  
 1   name          2338 non-null   object 
 2   platform      2338 non-null   object 
 3   year          2338 non-null   int64  
 4   genre         2338 non-null   object 
 5   publisher     2338 non-null   object 
 6   na_sales      2338 non-null   float64
 7   eu_sales      2338 non-null   float64
 8   jp_sales      2338 non-null   float64
 9   other_sales   2338 non-null   float64
 10  global_sales  2338 non-null   float64
dtypes: float64(5), int64(2), object(4)
memory usage: 219.2+ KB


In [5]:
# Encoding the publisher and platform attributes for training the model.
sales.platform = pd.Categorical(pd.factorize(sales.platform)[0])
sales.publisher = pd.Categorical(pd.factorize(sales.publisher)[0])
sales.genre = pd.Categorical(pd.factorize(sales.genre)[0])
sales.info()

<class 'pandas.core.frame.DataFrame'>
Float64Index: 2338 entries, 0.0 to 2337.0
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   rank          2338 non-null   int64   
 1   name          2338 non-null   object  
 2   platform      2338 non-null   category
 3   year          2338 non-null   int64   
 4   genre         2338 non-null   category
 5   publisher     2338 non-null   category
 6   na_sales      2338 non-null   float64 
 7   eu_sales      2338 non-null   float64 
 8   jp_sales      2338 non-null   float64 
 9   other_sales   2338 non-null   float64 
 10  global_sales  2338 non-null   float64 
dtypes: category(3), float64(5), int64(2), object(1)
memory usage: 173.0+ KB


### Feature Selection

- Here, we need to divide given columns into two types of variables dependent(or target variable) and independent variable(or feature variables).

- For our target variable we are choosing the platform attribute.


In [6]:
#split dataset in features and target variable
feature_cols = ['rank', 'publisher','year', 'genre', 'na_sales', 'eu_sales', 'jp_sales', 'other_sales', 'global_sales']
X = sales[feature_cols] # Features
y = sales.platform # Target variable

### Spitting the Data

- Here, we split the dataset into a training and test set. Our split consists of 70% training and testing on the remaining 30%.

In [7]:
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test

### Building Decision Tree Model

In [8]:
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

### Evaluation of the Decision Tree Model

In [9]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
pre = precision_score(y_test, y_pred, average='micro')
re = recall_score(y_test, y_pred, average='micro')
print("Precision:", pre)
print("Recall:", re)

Accuracy: 0.452991452991453
Precision: 0.452991452991453
Recall: 0.452991452991453


# 2.0 Classification Using Gradient Boosting

### Building Gradient Boosting Model

In [10]:
gradient_booster = GradientBoostingClassifier(learning_rate=0.1)
gradient_booster.fit(X_train,y_train)


### Evaluation of the Gradient Boosting Model

In [11]:
y_pred = gradient_booster.predict(X_test)
acc = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='micro')
recall = recall_score(y_test, y_pred, average='micro')
print("Accuracy:", acc)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.5783475783475783
Precision: 0.5783475783475783
Recall: 0.5783475783475783


# 3.0 Classification Using Random Forest Algorithm

### Building Random Forest Alogrithm Model

In [12]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

### Evaluation of the Gradient Boosting Mode

In [13]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='micro')
recall = recall_score(y_test, y_pred, average='micro')
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.5612535612535613
Precision: 0.5612535612535613
Recall: 0.5612535612535613


# 4.0 Comparison and Discussion

| Score Measure       | Decision Tree | Gradient Boost Model | Random Forest Algorithm| 
|---------------------|---------------|----------------------|------------------------|
| Accuracy            | 0.438746438   | 0.5555555555555556   | 0.5612535612535613     | 
| Precision           | 0.438746438   | 0.5555555555555556   | 0.5612535612535613     |
| Recall              | 0.438746438   | 0.5555555555555556   | 0.5612535612535613     |

The three models trained on the sales data set for games are the Decision Tree model, Gradient Boost model, and Random Forest algorithm. The accuracy, precision, and recall scores for the three models are given respectively as 0.4387, 0.5556, and 0.5613.

Accuracy is a measure of how well the model correctly predicts the class labels, while precision is a measure of how many of the predicted positive cases are actually positive. Recall is a measure of how many of the actual positive cases are correctly identified by the model.

From the scores, we can see that the Random Forest algorithm performs the best with the highest scores for all three metrics, followed by the Gradient Boost model. The Decision Tree model has the lowest scores for all three metrics, indicating that it may not perform as well as the other two models.

The Random Forest algorithm is an ensemble learning method that combines multiple decision trees to improve the accuracy of the model. This is achieved by randomly selecting subsets of features and samples to create each tree, reducing overfitting and improving the generalization ability of the model.

The Gradient Boost model is also an ensemble learning method that combines multiple weak models to create a strong model. It builds trees in a stage-wise manner, with each new tree correcting errors made by the previous trees. This method can be more computationally expensive than the Random Forest algorithm, but it can be more accurate.

The Decision Tree model is a simple and interpretable model that creates a tree-like structure to classify data based on a set of rules. However, it may be prone to overfitting and may not generalize well to new data.

In conclusion, the Random Forest algorithm and Gradient Boost model outperform the Decision Tree model in terms of accuracy, precision, and recall. However, the choice of model ultimately depends on the specific needs and constraints of the problem at hand, such as the size of the data set and the computational resources available.

# 5.0 Actionable Items

- Feature engineering is the process of developing new features from current data in order to increase the model's capacity for prediction. We can raise the models' accuracy by finding and developing new features that are more pertinent to the issue at hand.

- Tuning of the hyperparameters: The performance of a model can be greatly influenced by the hyperparameters, which are parameters that are established before the model is trained. The accuracy and generalisation capacity of the model can be increased by adjusting these hyperparameters. For the Gradient Boost model or Random Forest technique, for instance, changing the quantity of trees, their depth, or learning rate might enhance performance.

- Increasing the size of the data set: The accuracy of the models can be enhanced by expanding the size of the data set. This can be done by gathering more data, employing strategies for data augmentation, or integrating different data sets.

By taking these actions, we can improve the performance of the models and make them more accurate and robust.