# Video Game Sales Data Analysis

### **Project Goal**
The goal of this project is to explore and analyze a dataset of video game sales to uncover key insights. We will focus on identifying trends in sales over time, examining which genres and platforms are most popular, and investigating the performance of top publishers.

### **Dataset**
The dataset used is the "Video Game Sales" dataset from Kaggle, containing sales data for over 16,500 games.

In [3]:
import pandas as pd
import numpy as np

In [4]:
# Load the dataset.
df = pd.read_csv('../data/vgsales.csv')

## 1. Data Exploration and Initial Inspection
In this section, we will load the dataset, perform an initial inspection to understand its structure, and handle any missing or inconsistent data.

In [5]:
# Display the first five rows in the DataFrame
df.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


In [6]:
# Get a summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16598 non-null  int64  
 1   Name          16598 non-null  object 
 2   Platform      16598 non-null  object 
 3   Year          16327 non-null  float64
 4   Genre         16598 non-null  object 
 5   Publisher     16540 non-null  object 
 6   NA_Sales      16598 non-null  float64
 7   EU_Sales      16598 non-null  float64
 8   JP_Sales      16598 non-null  float64
 9   Other_Sales   16598 non-null  float64
 10  Global_Sales  16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB


In [7]:
# Generate descriptive statistics for numerical columns
df.describe()

Unnamed: 0,Rank,Year,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
count,16598.0,16327.0,16598.0,16598.0,16598.0,16598.0,16598.0
mean,8300.605254,2006.406443,0.264667,0.146652,0.077782,0.048063,0.537441
std,4791.853933,5.828981,0.816683,0.505351,0.309291,0.188588,1.555028
min,1.0,1980.0,0.0,0.0,0.0,0.0,0.01
25%,4151.25,2003.0,0.0,0.0,0.0,0.0,0.06
50%,8300.5,2007.0,0.08,0.02,0.0,0.01,0.17
75%,12449.75,2010.0,0.24,0.11,0.04,0.04,0.47
max,16600.0,2020.0,41.49,29.02,10.22,10.57,82.74


In [9]:
# Create a copy of the DataFrame to work with
df_cleaned = df.copy()

In [10]:
# Remove the rows with missing values in "Year" or "Publisher"
df_cleaned = df_cleaned.dropna(subset=['Year', 'Publisher'])

In [11]:
# Convert the "Year" column from a float to an integer
df_cleaned['Year'] = df_cleaned['Year'].astype('int64')

In [12]:
# Show the information of the cleaned DataFrame
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16291 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16291 non-null  int64  
 1   Name          16291 non-null  object 
 2   Platform      16291 non-null  object 
 3   Year          16291 non-null  int64  
 4   Genre         16291 non-null  object 
 5   Publisher     16291 non-null  object 
 6   NA_Sales      16291 non-null  float64
 7   EU_Sales      16291 non-null  float64
 8   JP_Sales      16291 non-null  float64
 9   Other_Sales   16291 non-null  float64
 10  Global_Sales  16291 non-null  float64
dtypes: float64(5), int64(2), object(4)
memory usage: 1.5+ MB


## 2. Data Cleaning and Preprocessing

Based on our initial inspection, we identified two main issues:
1.  The `Year` and `Publisher` columns had a small number of missing values.
2.  The `Year` column was incorrectly stored as a `float` data type.

To address these issues, we took the following steps:
-   Removed rows with missing values in `Year` and `Publisher`.
-   Converted the `Year` column to an integer data type to ensure it is in the correct format for future analysis.

Our cleaned DataFrame, `df_cleaned`, now has no missing values and the correct data types.