## Game Recommender

by: Jerry Phillips

## Background
As required by the Umass Capstone project this project will attempt to provide Game recommendations off of a generated dataset provided from Kaggle.

# 1. Install and import the required libraries

In [2]:
# Install the required library
!pip install -U scikit-learn
!pip install kaggle



In [12]:
import os
import zipfile
import json

# library for data processing
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.preprocessing import MinMaxScaler

# library to make the recommendation system model
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity

# library for evaluate the machine learning model
from sklearn.metrics import calinski_harabasz_score, davies_bouldin_score

# 2. Prepares the Dataset

### 2.1 Prepare the username and kaggle key

In [13]:
# prepares the Kaggle credential environment

os.environ['KAGGLE_USERNAME'] = 'jerseid'
os.environ['KAGGLE_KEY'] = 'KGAT_21a1fc6b16960208c03e2fcefe67e219'

### 2.2 Download and prepare the dataset

In [5]:
#!/bin/bash
!kaggle datasets download jahnavipaliwal/video-game-reviews-and-ratings

Dataset URL: https://www.kaggle.com/datasets/jahnavipaliwal/video-game-reviews-and-ratings
License(s): apache-2.0
Downloading video-game-reviews-and-ratings.zip to /Users/jerryphillips/anaconda_projects/f0bafc16-4759-43a0-9709-3792efe4f201
 91%|██████████████████████████████████▌   | 1.00M/1.10M [00:00<00:00, 2.49MB/s]
100%|██████████████████████████████████████| 1.10M/1.10M [00:00<00:00, 2.67MB/s]


In [9]:
# Extract zip file to CWD
import zipfile

# Define the zip file path
files = "/Users/jerryphillips/Downloads/archive.zip"

# Get the current working directory (which should be writable)
extract_path = os.getcwd()  # This will use the current working directory

# Extract the zip file
with zipfile.ZipFile(files, 'r') as zip_ref:
    zip_ref.extractall(extract_path)  # Extract to the current working directory instead of '/content'
    # Using 'with' statement automatically closes the zip file

# 3. Data Understanding

### 3.1 Read data with pandas DataFrame

In [14]:
df = pd.read_csv(files)
df.head()

Unnamed: 0,Game Title,User Rating,Age Group Targeted,Price,Platform,Requires Special Device,Developer,Publisher,Release Year,Genre,Multiplayer,Game Length (Hours),Graphics Quality,Soundtrack Quality,Story Quality,User Review Text,Game Mode,Min Number of Players
0,Grand Theft Auto V,36.4,All Ages,41.41,PC,No,Game Freak,Innersloth,2015,Adventure,No,55.3,Medium,Average,Poor,"Solid game, but too many bugs.",Offline,1
1,The Sims 4,38.3,Adults,57.56,PC,No,Nintendo,Electronic Arts,2015,Shooter,Yes,34.6,Low,Poor,Poor,"Solid game, but too many bugs.",Offline,3
2,Minecraft,26.8,Teens,44.93,PC,Yes,Bungie,Capcom,2012,Adventure,Yes,13.9,Low,Good,Average,"Great game, but the graphics could be better.",Offline,5
3,Bioshock Infinite,38.4,All Ages,48.29,Mobile,Yes,Game Freak,Nintendo,2015,Sports,No,41.9,Medium,Good,Excellent,"Solid game, but the graphics could be better.",Online,4
4,Half-Life: Alyx,30.1,Adults,55.49,PlayStation,Yes,Game Freak,Epic Games,2022,RPG,Yes,13.2,High,Poor,Good,"Great game, but too many bugs.",Offline,1


In [15]:
df.shape

(47774, 18)

In [16]:
# Check dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47774 entries, 0 to 47773
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Game Title               47774 non-null  object 
 1   User Rating              47774 non-null  float64
 2   Age Group Targeted       47774 non-null  object 
 3   Price                    47774 non-null  float64
 4   Platform                 47774 non-null  object 
 5   Requires Special Device  47774 non-null  object 
 6   Developer                47774 non-null  object 
 7   Publisher                47774 non-null  object 
 8   Release Year             47774 non-null  int64  
 9   Genre                    47774 non-null  object 
 10  Multiplayer              47774 non-null  object 
 11  Game Length (Hours)      47774 non-null  float64
 12  Graphics Quality         47774 non-null  object 
 13  Soundtrack Quality       47774 non-null  object 
 14  Story Quality         

In [17]:
df.isna().sum()

Game Title                 0
User Rating                0
Age Group Targeted         0
Price                      0
Platform                   0
Requires Special Device    0
Developer                  0
Publisher                  0
Release Year               0
Genre                      0
Multiplayer                0
Game Length (Hours)        0
Graphics Quality           0
Soundtrack Quality         0
Story Quality              0
User Review Text           0
Game Mode                  0
Min Number of Players      0
dtype: int64

In [18]:
# Describe dataset column
df.describe()

Unnamed: 0,User Rating,Price,Release Year,Game Length (Hours),Min Number of Players
count,47774.0,47774.0,47774.0,47774.0,47774.0
mean,29.719329,39.951371,2016.480952,32.481672,5.116758
std,7.550131,11.520342,4.027276,15.872508,2.769521
min,10.1,19.99,2010.0,5.0,1.0
25%,24.3,29.99,2013.0,18.8,3.0
50%,29.7,39.845,2016.0,32.5,5.0
75%,35.1,49.9575,2020.0,46.3,7.0
max,49.5,59.99,2023.0,60.0,10.0


# 4. Data Preparation

### 4.1 Drop column that have missing values and unused

In [19]:
df.drop(['Requires Special Device', 'Publisher'], axis=1, inplace=True)

### 4.2 Clean every columns of the data

#### 4.2.1 Loop through colums for missing values 

In [44]:
# Check missing values
cols = df.columns.tolist()
for col in cols:
    if(df[col].isna().sum() == 0):
      print("There is no empty data in the % s column" % col)
    else:
      print("Missing value detected in % s column" % col)

There is no empty data in the Game Title column
There is no empty data in the User Rating column
There is no empty data in the Age Group Targeted column
There is no empty data in the Price column
There is no empty data in the Platform column
There is no empty data in the Developer column
There is no empty data in the Release Year column
There is no empty data in the Genre column
There is no empty data in the Multiplayer column
There is no empty data in the Game Length (Hours) column
There is no empty data in the Graphics Quality column
There is no empty data in the Soundtrack Quality column
There is no empty data in the Story Quality column
There is no empty data in the User Review Text column
There is no empty data in the Game Mode column
There is no empty data in the Min Number of Players column


#### 4.2.2 Platform Column

In [23]:
# use collections Counter to check the sum of each platform column element
platform_counter = Counter(df['Platform'])
platform_counter

Counter({'PlayStation': 9633,
         'PC': 9599,
         'Nintendo Switch': 9596,
         'Mobile': 9589,
         'Xbox': 9357})

In [24]:
# Check unique element
df['Platform'].unique()

array(['PC', 'Mobile', 'PlayStation', 'Xbox', 'Nintendo Switch'],
      dtype=object)

#### 4.2.3 Genre Column

In [25]:
# Check missing value
df['Genre'].isna().sum()

np.int64(0)

In [26]:
# Check unique element
df['Genre'].unique()

array(['Adventure', 'Shooter', 'Sports', 'RPG', 'Simulation', 'Strategy',
       'Fighting', 'Action', 'Party', 'Puzzle'], dtype=object)

In [27]:
# check dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47774 entries, 0 to 47773
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Game Title             47774 non-null  object 
 1   User Rating            47774 non-null  float64
 2   Age Group Targeted     47774 non-null  object 
 3   Price                  47774 non-null  float64
 4   Platform               47774 non-null  object 
 5   Developer              47774 non-null  object 
 6   Release Year           47774 non-null  int64  
 7   Genre                  47774 non-null  object 
 8   Multiplayer            47774 non-null  object 
 9   Game Length (Hours)    47774 non-null  float64
 10  Graphics Quality       47774 non-null  object 
 11  Soundtrack Quality     47774 non-null  object 
 12  Story Quality          47774 non-null  object 
 13  User Review Text       47774 non-null  object 
 14  Game Mode              47774 non-null  object 
 15  Mi

#### 4.2.4 Publisher Column

In [30]:
# Check unique element
df['Developer'].unique()

array(['Game Freak', 'Nintendo', 'Bungie', 'Capcom', 'Epic Games',
       'CD Projekt Red', 'EA Sports', 'Rockstar Games', 'Innersloth',
       'Valve'], dtype=object)

In [31]:
# Check the unknown element
df[df['Developer'] == 'Unknown']

Unnamed: 0,Game Title,User Rating,Age Group Targeted,Price,Platform,Developer,Release Year,Genre,Multiplayer,Game Length (Hours),Graphics Quality,Soundtrack Quality,Story Quality,User Review Text,Game Mode,Min Number of Players


#### 4.2.5 Release Year 

In [45]:
# change column type to string because it is categorical
df['Release Year'] = df['Release Year'].astype('str')

#### 4.2.6 User Rating

In [46]:
# shift decimal for rating value
df['User Rating'] = df['User Rating'] / 10

In [47]:
df.head()

Unnamed: 0,Game Title,User Rating,Age Group Targeted,Price,Platform,Developer,Release Year,Genre,Multiplayer,Game Length (Hours),Graphics Quality,Soundtrack Quality,Story Quality,User Review Text,Game Mode,Min Number of Players
0,Grand Theft Auto V,0.364,All Ages,41.41,PC,Game Freak,2015,Adventure,No,55.3,Medium,Average,Poor,"Solid game, but too many bugs.",Offline,1
1,The Sims 4,0.383,Adults,57.56,PC,Nintendo,2015,Shooter,Yes,34.6,Low,Poor,Poor,"Solid game, but too many bugs.",Offline,3
2,Minecraft,0.268,Teens,44.93,PC,Bungie,2012,Adventure,Yes,13.9,Low,Good,Average,"Great game, but the graphics could be better.",Offline,5
3,Bioshock Infinite,0.384,All Ages,48.29,Mobile,Game Freak,2015,Sports,No,41.9,Medium,Good,Excellent,"Solid game, but the graphics could be better.",Online,4
4,Half-Life: Alyx,0.301,Adults,55.49,PlayStation,Game Freak,2022,RPG,Yes,13.2,High,Poor,Good,"Great game, but too many bugs.",Offline,1


In [36]:
# Describe dataset column
df.describe()

Unnamed: 0,User Rating,Price,Release Year,Game Length (Hours),Min Number of Players
count,47774.0,47774.0,47774.0,47774.0,47774.0
mean,2.971933,39.951371,2016.480952,32.481672,5.116758
std,0.755013,11.520342,4.027276,15.872508,2.769521
min,1.01,19.99,2010.0,5.0,1.0
25%,2.43,29.99,2013.0,18.8,3.0
50%,2.97,39.845,2016.0,32.5,5.0
75%,3.51,49.9575,2020.0,46.3,7.0
max,4.95,59.99,2023.0,60.0,10.0


In [48]:
# check missing value of the whole data
df.isna().sum()

Game Title               0
User Rating              0
Age Group Targeted       0
Price                    0
Platform                 0
Developer                0
Release Year             0
Genre                    0
Multiplayer              0
Game Length (Hours)      0
Graphics Quality         0
Soundtrack Quality       0
Story Quality            0
User Review Text         0
Game Mode                0
Min Number of Players    0
dtype: int64

In [49]:
# check that all missing values in the dataset have been removed
if df.isna().sum().sum() == 0:
  print('Dataset cleaned')
else:
  print('Missing value detected')

Dataset cleaned
