# Building a Comprehensive Video Game Recommendation System

## Introduction
In the rapidly growing video game industry, players often face the challenge of discovering new games that match their interests and preferences. With thousands of games released across various platforms and genres, finding the next game to play can be overwhelming. This project aims to address this challenge by developing a comprehensive video game recommendation system.

Leveraging the `Video Game Sales with Ratings` dataset from Kaggle, our objective is to create a recommender system that suggests video games based on user preferences and game similarities. The dataset includes key information about video games such as titles, platforms, release years, genres, sales figures, and ratings, providing a rich source of data for building our models.

We will explore multiple recommendation approaches to ensure a robust and versatile system:
1. **Cosine Similarity-Based Recommender:** A content-based approach that measures the similarity between games based on their attributes.
2. **Simple Recommender:** A non-personalized recommendation method that suggests popular games based on sales and ratings.
3. **Content-Based Recommender:** Utilizes text-based features like game descriptions to find similar games.
4. **Collaborative Filtering:** A personalized recommendation method that leverages user interaction data to suggest games.

The project will be structured as follows:
1. **Dataset Overview:** Introducing the dataset and its features.
2. **Exploratory Data Analysis (EDA):** Performing a detailed analysis to understand data distributions, trends, and patterns.
3. **Data Cleaning and Preprocessing:** Preparing the data by handling missing values, outliers, and encoding categorical variables.
4. **Feature Selection:** Identifying the most relevant features for building effective recommendation models.
5. **Model Training:** Implementing and training multiple recommendation models.
6. **Model Evaluation:** Evaluating the performance of each model using appropriate metrics.
7. **Comparison and Final Model Selection:** Comparing the models and selecting the best performing one.

Through this comprehensive approach, we aim to deliver a recommendation system that enhances the gaming experience by providing personalized and relevant game suggestions.

## Dataset Overview
The Dataset Overview section provides a comprehensive introduction to the Video_Games_Sales_as_at_22_Dec_2016.csv dataset. This dataset contains information about video game sales across various platforms up to December 22, 2016. It includes key features such as game titles, platforms, release years, genres, sales figures across different regions, and critic and user ratings. This section will load the dataset, display its structure, and summarize the main characteristics of the data, laying the foundation for further analysis and model building.

In [None]:
# Importing necessary libraries
import warnings

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np

warnings.filterwarnings("ignore")

In [None]:
# Loading the dataset
file_path = '../data/Video_Games_Sales_as_at_22_Dec_2016.csv'
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
print("First few rows of the dataset:")
display(df.head())

# Display the summary of the dataset
print("\nDataset summary:")
display(df.info())

# Display basic statistics for numerical columns
print("\nBasic statistics for numerical columns:")
display(df.describe())

# Display the list of columns and their descriptions
print("\nList of columns:")
columns = {
    'Name': 'Name of the video game',
    'Platform': 'Platform of the video game release (e.g., PS4, Xbox One, PC)',
    'Year_of_Release': 'Year of release of the video game',
    'Genre': 'Genre of the video game (e.g., Action, Sports, RPG)',
    'Publisher': 'Publisher of the video game',
    'NA_Sales': 'Sales in North America (in millions)',
    'EU_Sales': 'Sales in Europe (in millions)',
    'JP_Sales': 'Sales in Japan (in millions)',
    'Other_Sales': 'Sales in other regions (in millions)',
    'Global_Sales': 'Total worldwide sales (in millions)',
    'Critic_Score': 'Aggregate score compiled by Metacritic staff (0-100)',
    'Critic_Count': 'Number of critic reviews counted towards the Critic Score',
    'User_Score': 'Score by Metacritic’s subscribers (0-10)',
    'User_Count': 'Number of user reviews counted towards the User Score',
    'Developer': 'Developer of the video game',
    'Rating': 'ESRB rating (e.g., E for Everyone, M for Mature)',
}
for col, desc in columns.items():
    print(f"{col}: {desc}")

# Display the number of missing values in each column
print("\nNumber of missing values in each column:")
display(df.isnull().sum())

**Summary of Dataset Overview**

In this section, we have loaded the `Video_Games_Sales_as_at_22_Dec_2016.csv` dataset and provided an overview of its structure and contents. We displayed the first few rows to get an initial glimpse of the data, summarized the dataset's attributes, and highlighted the key features. Additionally, we listed the columns with their descriptions and identified missing values in the dataset. This comprehensive overview sets the stage for deeper exploratory data analysis and subsequent steps in building our recommendation system.

## Exploratory Data Analysis (EDA)
The Exploratory Data Analysis (EDA) section aims to explore the `Video_Games_Sales_as_at_22_Dec_2016.csv` dataset in depth to understand its structure, distributions, and relationships between features. This step is crucial for uncovering insights and patterns that will guide the subsequent data preprocessing and model-building phases. We will use a variety of statistical summaries and visualizations to examine the distributions of sales figures, genre popularity, platform trends, and ratings. This comprehensive analysis will help identify any anomalies, trends, and key characteristics of the data.

In [None]:
# Setting up visual styles
sns.set(style="whitegrid")
plt.style.use('fivethirtyeight')

# Displaying the first few rows of the dataset again for reference
print("First few rows of the dataset:")
display(df.head())

# 1. Distribution of Global Sales
plt.figure(figsize=(10, 6))
sns.histplot(df['Global_Sales'], kde=True, bins=30)
plt.title('Distribution of Global Sales')
plt.xlabel('Global Sales (in millions)')
plt.ylabel('Frequency')
plt.show()

# 2. Sales by Genre
plt.figure(figsize=(12, 8))
sns.boxplot(x='Genre', y='Global_Sales', data=df)
plt.xticks(rotation=90)
plt.title('Global Sales by Genre')
plt.xlabel('Genre')
plt.ylabel('Global Sales (in millions)')
plt.show()

# 3. Sales by Platform
plt.figure(figsize=(14, 8))
platform_sales = df.groupby('Platform')['Global_Sales'].sum().sort_values(ascending=False)
sns.barplot(x=platform_sales.index, y=platform_sales.values, palette='viridis')
plt.title('Total Global Sales by Platform')
plt.xlabel('Platform')
plt.ylabel('Total Global Sales (in millions)')
plt.xticks(rotation=90)
plt.show()

# 4. Sales by Year of Release
plt.figure(figsize=(14, 8))
year_sales = df.groupby('Year_of_Release')['Global_Sales'].sum().sort_index()
sns.lineplot(x=year_sales.index, y=year_sales.values)
plt.title('Total Global Sales by Year of Release')
plt.xlabel('Year of Release')
plt.ylabel('Total Global Sales (in millions)')
plt.xticks(rotation=90)
plt.show()

# 5. Distribution of Critic Scores
plt.figure(figsize=(10, 6))
sns.histplot(df['Critic_Score'].dropna(), kde=True, bins=30)
plt.title('Distribution of Critic Scores')
plt.xlabel('Critic Score')
plt.ylabel('Frequency')
plt.show()

# 6. Distribution of User Scores
plt.figure(figsize=(10, 6))
sns.histplot(df['User_Score'].dropna(), kde=True, bins=30)
plt.title('Distribution of User Scores')
plt.xlabel('User Score')
plt.ylabel('Frequency')
plt.show()

# 7. Correlation Heatmap (excluding non-numeric columns)
plt.figure(figsize=(12, 8))
numeric_df = df.select_dtypes(include=['float64', 'int64'])  # Selecting only numeric columns
correlation_matrix = numeric_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Features')
plt.show()

# 8. Distribution of Genres
plt.figure(figsize=(12, 8))
sns.countplot(y='Genre', data=df, order=df['Genre'].value_counts().index, palette='viridis')
plt.title('Distribution of Genres')
plt.xlabel('Count')
plt.ylabel('Genre')
plt.show()

# 9. Distribution of Platforms
plt.figure(figsize=(14, 8))
sns.countplot(y='Platform', data=df, order=df['Platform'].value_counts().index, palette='viridis')
plt.title('Distribution of Platforms')
plt.xlabel('Count')
plt.ylabel('Platform')
plt.show()

# 10. Distribution of Ratings
plt.figure(figsize=(12, 8))
sns.countplot(y='Rating', data=df, order=df['Rating'].value_counts().index, palette='viridis')
plt.title('Distribution of Ratings')
plt.xlabel('Count')
plt.ylabel('Rating')
plt.show()


**Summary of Exploratory Data Analysis (EDA)**
In this EDA section, we explored the Video_Games_Sales_as_at_22_Dec_2016.csv dataset through various visualizations and statistical summaries. We analyzed the distribution of global sales, examined sales trends across different genres, platforms, and years of release, and visualized the distributions of critic and user scores. Additionally, we created a correlation heatmap to identify relationships between numerical features and explored the distributions of genres, platforms, and ratings.

We observed that there is a scarcity of data for certain platforms such as DC and certain ratings such as 'K-A', 'AO', 'EC', and 'RP'. These insights provide a deeper understanding of the dataset and will inform our data preprocessing and feature selection strategies in the subsequent steps.

## Data Cleaning and Preprocessing
The Data Cleaning and Preprocessing section focuses on preparing the dataset for modeling by handling missing values, creating new features, and transforming the data. This involves removing records with missing critical data, imputing missing values for scores, converting categorical features to dummy variables, and standardizing numerical data. These steps ensure that the dataset is clean, consistent, and suitable for building effective recommendation models.


In [None]:
# Display the initial summary of the dataset
print("Initial dataset summary:")
display(df.info())

In [None]:
# 1. Remove records with missing data in 'Name', 'Genre', and 'Rating'
df = df.dropna(subset=['Name', 'Genre', 'Rating'])
print("\nDataset summary after removing records with missing 'Name', 'Genre', and 'Rating':")
display(df.info())

In [None]:
# 2. Create additional features for User_Score and Critic_Score and impute missing values

# Replace 'tbd' value to NaN
df['User_Score'] = np.where(df['User_Score'] == 'tbd', np.nan, df['User_Score']).astype(float)

# Group the records by Genre, then aggregate them calculating the average of both Critic Score and User Score
df_grp_by_genre = df[['Genre', 'Critic_Score', 'User_Score']].groupby('Genre', as_index=False)
df_score_mean = df_grp_by_genre.agg(Ave_Critic_Score = ('Critic_Score', 'mean'), Ave_User_Score = ('User_Score', 'mean'))

# Merge the average scores with the main dataframe
df = df.merge(df_score_mean, on='Genre')
df

In [None]:
# 3. Impute missing values by calculating the mean within each genre
df['Critic_Score_Imputed'] = np.where(df['Critic_Score'].isna(), df['Ave_Critic_Score'], df['Critic_Score'])
df['User_Score_Imputed'] = np.where(df['User_Score'].isna(), df['Ave_User_Score'], df['User_Score'])

print("\nSummary statistics for User_Score and User_Score_Imputed:")
display(df[['User_Score', 'User_Score_Imputed']].describe())

print("\nSummary statistics for Critic_Score and Critic_Score_Imputed:")
display(df[['Critic_Score', 'Critic_Score_Imputed']].describe())

In [None]:
# 4. Drop fields related to critic and user scores except for the new features with imputed values
final_df = df.drop(columns=['User_Score', 'Critic_Score', 'Ave_Critic_Score', 'Ave_User_Score'], axis=1)
final_df = final_df.reset_index(drop=True)
final_df = final_df.rename(columns={'Critic_Score_Imputed':'Critic_Score', 'User_Score_Imputed':'User_Score'})

# 5. Filter out only required columns
final_df = final_df[['Name', 'Platform', 'Genre', 'Rating', 'Critic_Score', 'User_Score']]
final_df.info()

In [None]:
# 6. Analyze the data distribution for `Critic_Score` and `User_Score`

# Distribution of Critic Scores
plt.figure(figsize=(10, 6))
sns.histplot(final_df['Critic_Score'].dropna(), kde=True, bins=30)
plt.title('Distribution of Critic Scores')
plt.xlabel('Critic Score')
plt.ylabel('Frequency')
plt.show()

# Distribution of User Scores
plt.figure(figsize=(10, 6))
sns.histplot(final_df['User_Score'].dropna(), kde=True, bins=30)
plt.title('Distribution of User Scores')
plt.xlabel('User Score')
plt.ylabel('Frequency')
plt.show()

# Distribution of User Scores
plt.figure(figsize=(10, 10))
ax = sns.regplot(x=final_df['User_Score'], y=final_df['Critic_Score'], line_kws={"color": "black"}, scatter_kws={'s': 4})
ax.set(xlabel ="User Score", ylabel = "Critic Score", title="User Scores vs. Critic Scores")

In [None]:
# 7. Converting Categorical Features to Dummy Indicators
categorical_features = [name for name in final_df.columns if final_df[name].dtype=='O']
categorical_features = categorical_features[1:] # except for the name

df_preprocessed = pd.get_dummies(data=final_df, columns=categorical_features)
df_preprocessed.head(10)

In [26]:
# 8. Standardizing the Numerical Features
features = df_preprocessed.drop(columns=['Name'], axis=1)
scale = StandardScaler()
scaled_features = scale.fit_transform(features)
scaled_features = pd.DataFrame(scaled_features, columns=features.columns)
scaled_features.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9950 entries, 0 to 9949
Data columns (total 39 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Critic_Score        9950 non-null   float64
 1   User_Score          9950 non-null   float64
 2   Platform_3DS        9950 non-null   float64
 3   Platform_DC         9950 non-null   float64
 4   Platform_DS         9950 non-null   float64
 5   Platform_GBA        9950 non-null   float64
 6   Platform_GC         9950 non-null   float64
 7   Platform_PC         9950 non-null   float64
 8   Platform_PS         9950 non-null   float64
 9   Platform_PS2        9950 non-null   float64
 10  Platform_PS3        9950 non-null   float64
 11  Platform_PS4        9950 non-null   float64
 12  Platform_PSP        9950 non-null   float64
 13  Platform_PSV        9950 non-null   float64
 14  Platform_Wii        9950 non-null   float64
 15  Platform_WiiU       9950 non-null   float64
 16  Platfo

**Summary of Data Cleaning and Preprocessing**
In the Data Cleaning and Preprocessing section, we performed several crucial steps to prepare the dataset for modeling. We removed records with missing data in the Name, Genre, and Rating features. We created additional features for `User_Score` and `Critic_Score`, imputing missing values with the mean value within each genre. We dropped fields related to critic and user scores except for the newly created imputed features and retained only the required columns.

We analyzed the data distribution for `Critic_Score` and `User_Score`, observing their distribution patterns and correlation. We transformed all categorical features into binary dummy variables and standardized numerical data to ensure that all features are on a similar scale.

The resulting preprocessed dataset has 9950 entries and 39 features, ready for building effective recommendation models. The analysis highlighted the scarcity of data for certain platforms and ratings, which will be considered during feature selection and model evaluation.