# Video Game Sales Analysis

In this analysis, we explore the video game sales data to uncover trends, identify the best-selling games, and understand the market dynamics by platform and genre. We'll load, clean, and analyze the data using Pandas and NumPy, focusing on sales across different regions and the popularity of video games over time.


In [2]:
import pandas as pd
import numpy as np

# Load the dataset into a Pandas DataFrame
video_game_sales_df = pd.read_csv('video_game_sales.csv')


## Data Exploration

First, let's take a look at the dataset to understand its structure and contents. We'll also check for any missing values and get a sense of the basic statistics for the numerical columns.


In [6]:
# Display the first 5 rows of the dataset
print(video_game_sales_df.head(5))

# Check for missing values
print(video_game_sales_df.isnull().sum())

# Explore basic statistics of the numerical columns
print(video_game_sales_df.describe())

# Identify unique platforms, genres, and publishers
print(video_game_sales_df['Platform'].unique())
print(video_game_sales_df['Genre'].unique())
print(video_game_sales_df['Publisher'].unique())


   Rank                      Name Platform       Year         Genre Publisher  \
0     1                Wii Sports      Wii 2006-01-01        Sports  Nintendo   
1     2         Super Mario Bros.      NES 1985-01-01      Platform  Nintendo   
2     3            Mario Kart Wii      Wii 2008-01-01        Racing  Nintendo   
3     4         Wii Sports Resort      Wii 2009-01-01        Sports  Nintendo   
4     5  Pokemon Red/Pokemon Blue       GB 1996-01-01  Role-Playing  Nintendo   

   NA_Sales  EU_Sales  JP_Sales  Other_Sales  Global_Sales  
0     41.49     29.02      3.77         8.46         82.74  
1     29.08      3.58      6.81         0.77         40.24  
2     15.85     12.88      3.79         3.31         35.82  
3     15.75     11.01      3.28         2.96         33.00  
4     11.27      8.89     10.22         1.00         31.37  
Rank              0
Name              0
Platform          0
Year            271
Genre             0
Publisher        58
NA_Sales          0
EU_Sale

## Data Cleaning

Now, let's clean our data. This includes converting the 'Year' column to datetime format and handling any inconsistencies or outliers in the data.


In [5]:
# Convert the 'Year' column to datetime format
video_game_sales_df['Year'] = pd.to_datetime(video_game_sales_df['Year'], format='%Y')

# Handle inconsistencies or outliers in the data
# Note: Specific handling depends on the data inspection. This might include removing or imputing values.


## Sales Analysis

We'll calculate total sales for each region and identify the top-selling games globally. Additionally, we'll find the platform with the highest average sales and determine the most popular genre based on global sales.


In [7]:
# Calculate the total sales for each region
region_sales = video_game_sales_df[['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']].sum()
print(region_sales)

# Identify the top-selling games globally
top_selling_games = video_game_sales_df.sort_values(by='Global_Sales', ascending=False).head()

# Find the platform with the highest average sales
highest_avg_sales_platform = video_game_sales_df.groupby('Platform').mean()['Global_Sales'].idxmax()

# Determine the most popular genre based on global sales
most_popular_genre = video_game_sales_df.groupby('Genre').sum()['Global_Sales'].idxmax()


NA_Sales        4392.95
EU_Sales        2434.13
JP_Sales        1291.02
Other_Sales      797.75
Global_Sales    8920.44
dtype: float64


TypeError: agg function failed [how->mean,dtype->object]

## Time-based Analysis

Let's analyze the trend of video game releases over the years and identify the year with the highest number of game releases.


In [9]:
# Analyze the trend of video game releases over the years
releases_per_year = video_game_sales_df.groupby(video_game_sales_df['Year'].dt.year).size()

# Identify the year with the highest number of game releases
year_with_most_releases = releases_per_year.idxmax()


## NumPy Tasks

Using NumPy, we'll create a new column for total sales, normalize the 'Global_Sales' column, and calculate the correlation matrix for numerical columns.


In [10]:
# Create a new column 'Total_Sales' that represents the sum of sales across all regions
video_game_sales_df['Total_Sales'] = video_game_sales_df[['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].sum(axis=1)

# Normalize the 'Global_Sales' column to a scale between 0 and 1
video_game_sales_df['Global_Sales_Normalized'] = (video_game_sales_df['Global_Sales'] - video_game_sales_df['Global_Sales'].min()) / (video_game_sales_df['Global_Sales'].max() - video_game_sales_df['Global_Sales'].min())

# Calculate the correlation matrix for numerical columns
correlation_matrix = np.corrcoef(video_game_sales_df.select_dtypes(include=[np.number]).T)
