# Video Game Sales Prediction - Machine Learning Lab

## Introduction and Setup
Welcome to this machine learning lab where we'll build a model to predict
whether a video game will be a "hit" based on its characteristics and sales data.
This notebook will guide you through the entire process, from data loading to
model evaluation and optimization.

Learning objectives:
1. Learn to preprocess and explore a real-world dataset
2. Build and evaluate a decision tree classifier
3. Optimize a model through hyperparameter tuning
4. Interpret model results and feature importance

In [None]:
#install libraries if necessary

# Import the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning libraries
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay


In [None]:
# Set random seed for reproducibility
np.random.seed(42)

In [None]:
# Configure visualizations
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("colorblind")

In [None]:
# Display settings for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

## Download the Dataset

In [None]:
# You can run this cell to download the dataset directly, or upload it manually!
import requests

url = 'https://www.kaggle.com/datasets/gregorut/videogamesales/download'
response = requests.get(url)

with open('videogamesales.zip', 'wb') as f:
	f.write(response.content)

print("Dataset downloaded successfully.")


## Load the Dataset

In [None]:
# Load the dataset
df = pd.read_csv('vgsales.csv')

# Let's take a look at the first few rows of the dataset
print("First 5 rows of the dataset:")
print(df.head())


## Dataset Information

In [None]:
# Get basic information about the dataset
print("\nDataset basic information:")
print(df.info())

# Get descriptive statistics
print("\nDescriptive statistics:")
print(df.describe())

In [None]:
# Cell 5:Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())

In [None]:
# Cell 6: Data Visualization - Global Sales Distribution
# ====================================================
# Visualize the distribution of global sales
plt.figure(figsize=(10, 6))
sns.histplot(df['Global_Sales'], bins=50, kde=True)
plt.title('Distribution of Global Sales')
plt.xlabel('Global Sales (millions of units)')
plt.ylabel('Frequency')
plt.axvline(x=1, color='red', linestyle='--', label='Hit Threshold (1M units)')
plt.legend()
plt.show()

In [None]:
# Cell 7: Create Target Variable
# ============================
# TASK: Create a binary target variable for "hit" games
# A game is considered a hit if it sold more than 1 million units (Global_Sales > 1)
# YOUR CODE HERE


In [None]:
# Cell 8: Analyze Target Distribution
# =================================
# Let's see the proportion of hits in our dataset
# YOUR CODE HERE


In [None]:
# Cell 9: Drop Non-Informative Columns
# ==================================
# TASK: Drop non-informative columns
# Think about which columns won't help with prediction
# YOUR CODE HERE


In [None]:
# Cell 10: Missing Value Analysis
# =============================
# Examine the 'Year' column which might have missing values


In [None]:
# Cell 11: Handle Missing Values
# ============================
# TASK: Handle missing values
# Option 1: Drop rows with missing values
# YOUR CODE HERE

# Option 2: Fill missing values with median or mean
# YOUR CODE HERE
# df_clean['Year'] = df_clean['Year'].fillna(df_clean['Year'].median())


In [None]:
# Cell 12: Categorical Variable Analysis
# ===================================
# Let's identify categorical columns
categorical_columns = df_clean.select_dtypes(include=['object']).columns.tolist()
print("\nCategorical columns:", categorical_columns)

In [None]:
# Cell 13: Encode Categorical Variables
# ==================================
# TASK: Encode categorical variables using LabelEncoder
# Label Encoder transforms categorical variables into numerical ones
# YOUR CODE HERE


In [None]:
# Cell 14: Feature Engineering (Optional)
# =====================================
# BONUS TASK: Feature Engineering
# Creating new features might improve model performance
# Example: Total regional sales besides global
# YOUR CODE HERE


In [None]:
# Cell 15: Explore Processed Dataset
# ================================
# Let's look at the processed dataset
# YOUR CODE HERE


In [None]:
# Cell 16: Split Features and Target
# ================================
# TASK: Split the data into features (X) and target (y)
# YOUR CODE HERE


In [None]:
# Cell 17: Train-Test Split
# =======================
# TASK: Split the data into training and testing sets (80/20 split)

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Print the shapes to confirm the split


In [None]:
# Cell 18: Train Initial Model
# ==========================
# TASK: Train a Decision Tree classifier with default parameters
# YOUR CODE HERE


In [None]:
# Cell 19: Make Predictions
# =======================
# TASK: Make predictions on the test set
# YOUR CODE HERE
                            # Probability of being a hit

In [None]:
# Cell 20: Calculate Evaluation Metrics
# ==================================
# TASK: Calculate evaluation metrics
# YOUR CODE HERE




In [None]:
# Cell 21: Confusion Matrix Visualization
# ====================================
# TASK: Visualize the confusion matrix
# YOUR CODE HERE
