#Table of Content
##1. Business Background & Objective
##2. Data Collection: Importing Libraries and Reading Dataset
##3. Data Preparation: Data Cleaning
##4. Pipeline-based workflow
*   Data Preparation: Data Pre-processing and Transforming the data.
*   Model Selection: Using Stacking Model with 3 Base models and a meta regressor model.
*   Training the Data: fitting the model to the training data.

##5. Model Evaluation
*   Evaluating the accuracy score using the score method.
*   Making predictions on test dataset.
*   Evaluating the accuracy of the model using R2, MAE, MSE, RMSE, and MAPE metrics.
*   Cross-Validation Evaluation Techniques - K-fold, Leave-one-out, and Shuffle Split methods.

##6. Hyperparameter Tuning using GridSearch Cross Validation Method with Pipeline
##7. Business Deployment.



##1. Business Background & Objective
####Business Background: IMAY Manufacturing Co. is a well-established organization that specializes in the manufacturing of War board games. The company has been manufacturing various war board games for 30 years, and its games perceived well among the gaming community. 

####Business Objective: The company is planning to develop/manufacture a new War board game, and before they initiate the development, they want to know the BGG ranking aka the popularity of the game.

Determining the BGG ranking of the game before development can provide the company with an idea about the popularity and acceptance of the game by the gaming community. So, if the game is expected to have a high BGG ranking, it would mean that the game has a good chance of being successful in the market, and the company can go ahead with development of the game.

##2. Data Collection: Importing Libraries and Reading the CSV dataset

Importing necessary libraries to read dataset, including pandas, seaborn, numpy.

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np

Reading dataset from War Domain CSV file and storing it in Pandas DataFrame df_game. 

In [2]:
df_game = pd.read_csv('War.csv', na_values='NA') #No need to add index_col=0 as ID of games are more than just a serial number, and we will be dropping that column later on

##3. Data Preparation: Data Cleaning

Check the datatypes of the columns, so that it can be cleaned appropriately.

In [3]:
df_game.info() #Displays the basic information about the War Game DataFrame like column names, number of non-null values, and data types.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3029 entries, 0 to 3028
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 3029 non-null   int64  
 1   Name               3029 non-null   object 
 2   YearPublished      3020 non-null   float64
 3   MinPlayers         3029 non-null   int64  
 4   MaxPlayers         3029 non-null   int64  
 5   PlayTime           3012 non-null   float64
 6   MinAge             3029 non-null   int64  
 7   UsersRated         3029 non-null   int64  
 8   RatingAverage      2800 non-null   object 
 9   BGGRank            3029 non-null   int64  
 10  ComplexityAverage  3029 non-null   object 
 11  OwnedUsers         3029 non-null   int64  
 12  Mechanics          3005 non-null   object 
 13  Domains            3029 non-null   object 
dtypes: float64(2), int64(7), object(5)
memory usage: 331.4+ KB


Cleaning up the object data-type values values in the ComplexityAverage column. So, that the data will be in most readable form while put in Machine Learning model.

In [4]:
df_game['ComplexityAverage'] = df_game['ComplexityAverage'].str.replace(',', '').astype(float).round().astype(int) #Replaces comma separated values with nothing, converts to float type, rounds it off, and then converts it back to integer type.

Cleaning up the object-type data values in the RatingAverage column. So, that the data will be in most readable form while put in Machine Learning model.

In [5]:
df_game['RatingAverage'] = pd.to_numeric(df_game['RatingAverage'].str.replace(',', '.'), errors='coerce') #Replaces comma separator with a period and then converts it to numeric data type. If any error occurs, it is coerced (converted) to NaN.

Cross check the columns with numerical values are either in int64 or float64 data type.

In [6]:
df_game.info() #Displays the basic information about the War Game DataFrame like column names, number of non-null values, and data types.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3029 entries, 0 to 3028
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 3029 non-null   int64  
 1   Name               3029 non-null   object 
 2   YearPublished      3020 non-null   float64
 3   MinPlayers         3029 non-null   int64  
 4   MaxPlayers         3029 non-null   int64  
 5   PlayTime           3012 non-null   float64
 6   MinAge             3029 non-null   int64  
 7   UsersRated         3029 non-null   int64  
 8   RatingAverage      2800 non-null   float64
 9   BGGRank            3029 non-null   int64  
 10  ComplexityAverage  3029 non-null   int64  
 11  OwnedUsers         3029 non-null   int64  
 12  Mechanics          3005 non-null   object 
 13  Domains            3029 non-null   object 
dtypes: float64(3), int64(8), object(3)
memory usage: 331.4+ KB


Check whether the dataframe has any categorial value or not.

In [7]:
df_game.describe(include='all') #Displays descriptive statistics about the DataFrame including count, mean, standard deviation, minimum, maximum, and quartiles for numeric columns, and count, unique values, top, and frequency for categorical columns.

Unnamed: 0,ID,Name,YearPublished,MinPlayers,MaxPlayers,PlayTime,MinAge,UsersRated,RatingAverage,BGGRank,ComplexityAverage,OwnedUsers,Mechanics,Domains
count,3029.0,3029,3020.0,3029.0,3029.0,3012.0,3029.0,3029.0,2800.0,3029.0,3029.0,3029.0,3005,3029
unique,,3000,,,,,,,,,,,1057,1
top,,Gettysburg,,,,,,,,,,,Hexagon Grid,Wargames
freq,,4,,,,,,,,,,,400,3029
mean,50573.468141,,1995.741391,1.771212,3.254539,306.586653,10.104985,217.338065,6.875493,9630.068339,263.443381,569.16375,,
std,72396.255714,,74.159419,0.495171,6.657371,1377.313381,4.525672,681.232342,0.83692,4586.445739,105.77987,1052.202777,,
min,27.0,,0.0,0.0,0.0,0.0,0.0,30.0,2.85,140.0,0.0,38.0,,
25%,5960.0,,1986.0,2.0,2.0,90.0,10.0,48.0,6.31,6030.0,230.0,196.0,,
50%,12282.0,,2001.0,2.0,2.0,150.0,12.0,81.0,6.9,9508.0,275.0,307.0,,
75%,63951.0,,2010.0,2.0,3.0,240.0,12.0,171.0,7.46,13009.0,329.0,573.0,,


Check the number of missing values in each column

In [8]:
df_game.isnull().sum() #Displays the number of null values in each column of the DataFrame.

ID                     0
Name                   0
YearPublished          9
MinPlayers             0
MaxPlayers             0
PlayTime              17
MinAge                 0
UsersRated             0
RatingAverage        229
BGGRank                0
ComplexityAverage      0
OwnedUsers             0
Mechanics             24
Domains                0
dtype: int64

##4. Pipeline Based Workflow

Categorize the dependable and independent variables:
#####Independent Variables: used as a function to drive the dependent variable/s. ID, Mechanics, Name, Year published columns are dropped because they are unique to the game, and doesn't imapct much upon the BGG rank. Domains column is dropped since we are only making one domain type game.
#####Dependent Variables: The value/s which is been predicted using independent variables.

In [9]:
game_y = df_game['BGGRank']
game_x = df_game.drop(columns=['ID','Mechanics','Name','Domains','YearPublished','BGGRank'])

Cross-check whether you have nescessary columns to perform the Machine Learning model, if not we can edit the above code accordingly.

In [10]:
game_x.head()

Unnamed: 0,MinPlayers,MaxPlayers,PlayTime,MinAge,UsersRated,RatingAverage,ComplexityAverage,OwnedUsers
0,2,2,15.0,10,234,6.4,146,337
1,1,2,180.0,12,51,4.68,238,170
2,1,2,180.0,12,59,5.39,307,270
3,2,2,360.0,12,258,5.79,362,785
4,2,2,120.0,12,117,6.43,238,348


Split the dataset into training and testing sets into 70-30% respectively. random state 1 or 42 is used to ensure reproducibility of the results. It has no major impact on the result, however we have to keep it constant across script.

In [11]:
from sklearn.model_selection import train_test_split  
X_train, X_test, Y_train, Y_test = train_test_split(
    game_x, game_y, test_size=0.3, random_state=42)  #30% test split will give enough data for model to test upon.

Create a pipeline to preprocess the data and apply the stacking Regressor Model and Train the data. Regressor Models are used to when we have to determine the actual numbers. Classifiers, as taught in class, are just for classifying the result into some pre-determined classification. for ex- if we want to determine the type of iris flower the result would be depending upon its sepal and stem length and width, the classifier model will be used.

#####A base model is trained on a dataset to make predictions/classify new data. All of the base models - random forest, gradient boost, and Support Vector model has their own advantages and limitation, combining them in a base model combines their strengths to give a more accurate result.
#####A meta model is trained on the output of multiple base models to improve their performance by combining their strengths and eliminating any weakness the base model/s has. Linear regression is used at Meta model to make the final model more interpretable.

In [12]:
#importing necessary libraries to work on
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

# Create transformers
imp = SimpleImputer(strategy='mean')
scaler = MinMaxScaler()

# Define transformers for different columns
tf = ColumnTransformer(transformers=[('imputer', imp, [2, 5])], remainder='passthrough')
tf2 = ColumnTransformer(transformers=[('scaler', scaler, [2, 4, 5, 6, 7])], remainder='passthrough')

# Define the base models
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
svr_model = SVR(kernel='rbf', gamma='scale')
gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Define the meta model
meta_model = LinearRegression()

# Define the stacking regressor
stack_model = StackingRegressor(estimators=[('rf', rf_model), ('svr', svr_model), ('gb', gb_model)], final_estimator=meta_model)

steps = [('imputer', tf),
         ('scaler', tf2),  
         ('stk_model', stack_model)]

Pipe = Pipeline(steps)

Fit the pipeline on the training data using the fit method.

In [13]:
pipe = Pipe.fit(X_train, Y_train)

Additional Machine Learning Evaluation done to check which model/combination of Models is best.

In [None]:
#Using Just Random Forest Regressor as Base Model and Linear Regression as meta model in the above code, the accuracy score of the model on the test set is: 96.34%:: 0.9634485277049024
#Using Just Gradient Boosting Regressor as Base Model and Linear Regression as meta model in the above code, the accuracy score of the model on the test set is: 96.30%:: 0.9630034165187327
#Using Just Support Vector Regression as Base Model and Linear Regression as meta model in the above code, the accuracy score of the model on the test set is: 00.05%:: 0.0005302180322895866
#Using Just Linear Regression Model in the above code, the accuracy score of the model on the test set is:67.93%:: 0.6793918290602308

### That's why a hybrid stacking model is used to get the best accuracy of the model using all three Random Forest, Support Vector and Gradient Booster as Base Model and Linear Regression as Meta Model.

###### Sample code for Random Forest Regressor and Linear Regression hybrid model is below, Just Changed the model.
#importing necessary libraries.
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

###### Create transformers
imp = SimpleImputer(strategy='mean')
scaler = MinMaxScaler()

###### Define transformers for different columns
tf = ColumnTransformer(transformers=[('imputer', imp, [2, 5])], remainder='passthrough')
tf2 = ColumnTransformer(transformers=[('scaler', scaler, [2, 4, 5, 6, 7])], remainder='passthrough')

###### Define the base models
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

###### Define the meta model
meta_model = LinearRegression()

###### Define the stacking regressor
stack_model = StackingRegressor(estimators=[('rf', rf_model)], final_estimator=meta_model)

steps = [('imputer', tf),
         ('scaler', tf2),  
         ('stk_model', stack_model)]

Pipe = Pipeline(steps)

##5. Model Evaluation

####Evaluate the accuracy score on the test data using the score method.

In [14]:
pipe.score(X_test, Y_test) #returns the accuracy score of the model on the test set.

0.9654833747867442

####Making prediction on the test dataset

In [15]:
Y_predict = pipe.predict(X_test)

###Evaluation the Accuracy of Model using Different Evalaution Metrics

R2, MAE, MSE, RMSE, and MAPE are used here as evaluation metrics for above stacking regression machine learning model to get different outlook on the accuracy of the model's predictions.

In [16]:
#importing evaluation metrics libraries.
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np

# Use the trained model to make predictions on the test data
Y_predict = pipe.predict(X_test)

# Calculate R2 (R-squared). Represents the proportion of the variance in the dependent variable.
r2 = r2_score(Y_test, Y_predict)
print(r2)

# Calculate MAE (Mean Absolute Error). Measures the average absolute difference between the predicted and actual values.
mae = mean_absolute_error(Y_test, Y_predict)
print(mae)

# Calculate MSE (Mean Squared Error). Similar to MAE, however it squares the difference between predicted and actual values before averaging.
mse = mean_squared_error(Y_test, Y_predict)
print(mse)

# Calculate RMSE (Root Mean Squared Error). Squared root of MSE.
rmse = np.sqrt(mean_squared_error(Y_test, Y_predict))
print(rmse)

# Calculate MAPE (Mean Absolute Percentage Error). MAPE provides a measure of the percentage error between the predicted and actual values.
mape = np.mean(np.abs((Y_test - Y_predict) / Y_test)) * 100
print(mape)

0.9654833747867442
471.93525337925604
726845.5239893072
852.5523585031639
6.023569652494656


Above values depict that the code is less prone to errorsome calculation and is working well so far.

###Cross-Validation Evaluation Techniques

####K-Fold Cross Validation
K-fold cross-validation estimate the performance of the Machine Learning model by splitting the data into k equal parts, using k-1 parts to train the model and the remaining part to evaluate its performance. This process is repeated k times, each time using a different part as the evaluation set. By doing this, we can obtain a more robust estimate of the model's performance and reduce the risk of overfitting. The smaller range of the array of solution in K-fold cross validation indicates that the model is stable and is less prone to over-fitting.

In [17]:
from sklearn.model_selection import KFold, cross_val_score
kf = KFold()
cross_val_score(pipe, game_x, game_y, cv=kf )

array([0.97196506, 0.95549904, 0.97195195, 0.96868837, 0.97375158])

####Leave-one-out Cross Validation
leave-one-out cross-validation (LOOCV) trains the model on all but one observation and evaluates its performance on the left-out observation. This process is repeated for each observation in the dataset. LOOCV can be computationally expensive for large datasets, but it can provide a more reliable estimate of the model's performance and stability. (Caution: It will take more than 3 hours of computation time).

In [None]:
from sklearn.model_selection import LeaveOneOut, cross_val_score
loo = LeaveOneOut()
cross_val_score(pipe, game_x, game_y, cv=loo)

####Shuffle Split Cross Validation
Shuffle split cross-validation randomly shuffles the data and split it into training and testing sets multiple times. This approach is mainly useful for datasets where the observations are not ordered by time, like gaming database in here.

In [18]:
from sklearn.model_selection import ShuffleSplit, cross_val_score
ss = ShuffleSplit()
cross_val_score(pipe, game_x, game_y, cv=ss)

array([0.97610679, 0.96571353, 0.98190533, 0.96435091, 0.96642889,
       0.9553103 , 0.97485069, 0.96001524, 0.98282628, 0.97840938])

The cross validation results are quite stable (Less range between the extreme values) shows that the model is doing good.

##6. Hyperparameter Tuning using GridSearch Cross Validation Method with Pipeline
GridSearchCV (Grid Search Cross Validation) is a method for tuning hyperparameters of a machine learning model to improve its performance. It is a systematic approach for analysing various combinations of hyperparameter values to identify the best performing model and its parameters.

Importing necessary library and specifying the parameters to perform a gridsearch to evaluate the best parameters.
#####The param_grid below is a bit computational heavy and will take 1 hour or so. To check whether the code is working or not, you might have to give 1 variable to each element. it will reduce its computation time.

In [19]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'imputer__imputer__strategy': ['mean', 'median'],
    'stk_model__rf__n_estimators': [100, 200, 300],
    'stk_model__gb__learning_rate': [0.05, 0.1, 0.2],
    'stk_model__gb__n_estimators': [50, 100, 150],
}

Returns the parameters of the pipeline.

In [20]:
pipe.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'imputer', 'scaler', 'stk_model', 'imputer__n_jobs', 'imputer__remainder', 'imputer__sparse_threshold', 'imputer__transformer_weights', 'imputer__transformers', 'imputer__verbose', 'imputer__verbose_feature_names_out', 'imputer__imputer', 'imputer__imputer__add_indicator', 'imputer__imputer__copy', 'imputer__imputer__fill_value', 'imputer__imputer__keep_empty_features', 'imputer__imputer__missing_values', 'imputer__imputer__strategy', 'imputer__imputer__verbose', 'scaler__n_jobs', 'scaler__remainder', 'scaler__sparse_threshold', 'scaler__transformer_weights', 'scaler__transformers', 'scaler__verbose', 'scaler__verbose_feature_names_out', 'scaler__scaler', 'scaler__scaler__clip', 'scaler__scaler__copy', 'scaler__scaler__feature_range', 'stk_model__cv', 'stk_model__estimators', 'stk_model__final_estimator__copy_X', 'stk_model__final_estimator__fit_intercept', 'stk_model__final_estimator__n_jobs', 'stk_model__final_estimator__positive', 'stk_model_

In [21]:
search = GridSearchCV(pipe, param_grid, n_jobs=-1, verbose=12)

Fits the GridSearchCV to the training data to perform cross-validation on the hyperparameters and selects the best set of hyperparameters as mentioned above.

In [22]:
search = search.fit(game_x, game_y)

Fitting 5 folds for each of 54 candidates, totalling 270 fits


Returns the best score based on the GridSearch Cross Validation

In [23]:
search.best_score_ #shows the gridsearch best score using the best parameters for the predicting model.

0.9685903845909367

above result is almost similar to what we get initially through regressor model. Shows that model is in great working condition.

Returns the best parameters based on Grid search Evaluation. Below are the best parameters for given elements. It takes around 1 hour to run the code: search = search.fit(game_x, game_y)

#####'stk_model__gb__learning_rate': 0.1
#####'imputer__imputer__strategy': 'mean'
#####'stk_model__gb__n_estimators': 150
#####'stk_model__rf__n_estimators': 100

In [24]:
search.best_params_ #getting the best parameters analysed after running gridsearch code.

{'imputer__imputer__strategy': 'mean',
 'stk_model__gb__learning_rate': 0.1,
 'stk_model__gb__n_estimators': 150,
 'stk_model__rf__n_estimators': 100}

Data is again collected and cleaned to perform Model Hyperparameter tuning using GridSearch CV

In [25]:
game_test = pd.read_csv('War.csv')
game_test = game_test.drop(columns=['ID','Mechanics','Name','Domains','YearPublished'])

In [26]:
game_test['ComplexityAverage'] = game_test['ComplexityAverage'].str.replace(',', '').astype(float).round().astype(int) #Replaces comma separated values with nothing, converts to float type, rounds it off, and then converts it back to integer type.

In [27]:
game_test['RatingAverage'] = pd.to_numeric(game_test['RatingAverage'].str.replace(',', '.'), errors='coerce') #Replaces comma separator with a period and then converts it to numeric data type. If any error occurs, it is coerced (converted) to NaN.

In [28]:
game_test.isnull().sum() #to check what columns have NaN values, and whether we are working with the correct number of columns or not.

MinPlayers             0
MaxPlayers             0
PlayTime              17
MinAge                 0
UsersRated             0
RatingAverage        229
BGGRank                0
ComplexityAverage      0
OwnedUsers             0
dtype: int64

BGG ranking Prediction is done using GridSearch Cross Validation method and is stored in a result.csv file

In [29]:
pred = search.predict(game_test)
print(pred) #predicting the result using grid search and printing the values to confirm whether the is working correctly or not.

[ 7159.37549058 19045.51623964 17537.42660589 ...  6484.48505239
  1794.42297608 10325.59605904]


In [30]:
np.savetxt("result.csv", pred, delimiter=",") #saving results in a result.csv file

##7. Business Deployment

The trained machine learning model, which predicts the BoardGameGeek (BGG) ranking of games based on various features such as MinPlayers, MaxPlayers, Playtime, UsersRated, ComplexityAverage and so on, can be integrated into a software application. The purpose of using the machine learning model is to analyze the BGG ranking of future games before development.

By using the integrated machine learning model, the company can understand whether developing the new game will be profitable for the company or not. For example, if the BGG Ranking score is above 7 points, the game is more likely to thrive in market and provide good profit to the company, than the games with just a single or couple points result, i.e. - 1-2 points.