# Video Game Sales Analysis Project
## Project Overview

In this project, you'll analyze video game sales data to identify patterns that determine a game's success. Working as an analyst for the online store Ice, you'll use this information to help plan future advertising campaigns.

## Environment Setup and Required Libraries

In [None]:
# Import all required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings
pd.set_option('display.max_columns', None)


## Step 1: Loading and Initial Data Exploration

First, let's load our dataset and examine its basic properties:

In [None]:
# Load the dataset
df = pd.read_csv('/datasets/games.csv')

# Display first few rows
df.head()

In [None]:
# Display basic information about the dataset
df.info()

In [None]:
# Check for duplicate entries
df.duplicated().sum()

### Key Questions to Answer:
- What's the total number of records in our dataset?
- What data types are present in each column?
- Are there any obvious issues with the data?
- Do we see any immediate patterns or anomalies?

# Total number of records

In [None]:
print(f"Total number of records: {df.shape[0]}")


# Data types present in each Column
The data types present in each column include float64(6) and object(5)

# Obvious issues with data

In [None]:
# Check missing values
df.isnull().sum()


Obvious issues includes missing values for year_of_realease, critic_score, user_score, rating

# Immediate patterns and anomalies

In [None]:
# Check unique values in some key columns
print("Unique platforms:", df['Platform'].unique())
print("Unique genres:", df['Genre'].unique())
print("Unique ratings:", df['Rating'].unique())


# Summary of patterns and anomalies
Category        Pattern                                Anomalies

platform Many active and legacy platforms	Obsolete platforms (e.g. 3DO, TG16)

Genre	Popular genres like Action, Sports, Shooter	nan, vague genre like Misc

Rating	Standard ESRB ratings present	K-A (obsolete), RP, nan, rare AO

## Step 2: Data Preparation

### 2.1 Standardizing Column Names

In [None]:
# Convert column names to lowercase
df.columns = df.columns.str.lower()

In [None]:
# Verify the changes
df.columns

### 2.2 Data Type Conversion

In [None]:
# Check current data types
df.dtypes

In [None]:
# Make changes to data types if necessary

# Convert 'year_of_release' to integer (optional: can leave as float if NaN exists)

# Coerce any problematic values to NaN
df['year_of_release'] = pd.to_numeric(df['year_of_release'], errors='coerce').astype('Int64')


In [None]:
# Pay attention to the abbreviation TBD (to be determined). Specify how you intend to handle such cases.

# Ensure 'user_score' is treated as string before converting
df['user_score'] = df['user_score'].astype(str)

# Convert 'user_score' to numeric, coercing 'TBD' and other non-numeric values to NaN
df['user_score'] = pd.to_numeric(df['user_score'], errors='coerce')

#check current data types
df.dtypes

### 2.3 Handling Missing Values

In [None]:
# Examine missing values

# Count missing values in each column
missing_values = df.isnull().sum()

# Display only columns with missing values
missing_values[missing_values > 0]

In [None]:
# Calculate percentage of missing values

# Percentage of missing values
missing_percent = (df.isnull().sum() / len(df)) * 100

# Display only columns with missing data
missing_percent[missing_percent > 0].sort_values(ascending=False)


In [None]:
# Calculate missing values and percentage as a DataFrame
missing_values = df.isnull().sum().to_frame(name='Missing Values')
missing_values['% Missing'] = round((df.isnull().sum() / len(df)) * 100, 2)

# Sort by percentage of missing values (descending)
missing_values = missing_values[missing_values['Missing Values'] > 0].sort_values(by='% Missing', ascending=False)

# Display the result
missing_values


<div style="background-color:lightblue; color:darkblue">
I have updated the missing values with to_frame</div>

# Analyze patterns in missing values
Over half of the games lack user scores and critic scores. Rating coming slightly under half at 40% with possible ESRB ratings missing. Year of release missing just over 1% of values with name and genre with a much smaller amount. 



In [None]:
# Handle missing values based on analysis
# Your code here to handle missing values according to your strategy

# 1. Drop rows where 'year_of_release' or 'genre' is missing
df = df.dropna(subset=['year_of_release', 'genre'])

# 2. Fill missing values in 'rating' with 'Unknown'
df['rating'] = df['rating'].fillna('Unknown')

# 3. Ensure 'user_score' is numeric (convert 'TBD' to NaN already done earlier)
# If not done already, make sure:
df['user_score'] = df['user_score'].astype(str)
df['user_score'] = pd.to_numeric(df['user_score'], errors='coerce')

# 4. Leave 'user_score' and 'critic_score' as-is (NaN will be ignored in correlations/visualizations)

# 5. Optionally drop rows where 'name' is missing 
df = df.dropna(subset=['name'])

# Final check on missing values
df.isnull().sum()

### 2.4 Calculate Total Sales

In [None]:
# Calculate total sales across all regions and put them in a different column

# Calculate total sales by summing across all regions
df['total_sales'] = df[['na_sales', 'eu_sales', 'jp_sales', 'other_sales']].sum(axis=1)

# Preview the updated DataFrame
df[['name', 'na_sales', 'eu_sales', 'jp_sales', 'other_sales', 'total_sales']].head()

# Step 3: Analyzing Video Game Sales Data

## 3.1 Temporal Analysis of Game Releases
Let's first examine the distribution of game releases across different years to understand our data's coverage and significance:

In [None]:
# Create a DataFrame with game releases by year
games_per_year = df['year_of_release'].value_counts().sort_index()

In [None]:
# Visualize the distribution of games across years
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
games_per_year.plot(kind='bar')
plt.title('Number of Games Released per Year')
plt.xlabel('Year of Release')
plt.ylabel('Number of Games')
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.tight_layout()
plt.show()


In [None]:
# Display summary statistics for each year
games_per_year.describe()

### Questions to Consider:
- Which years show significant numbers of game releases? 

Peaks in game releases (typically mid-2000s to early 2010s)
Drop-offs in older years (1980s, early 90s)
Possibly a decline toward 2016 due to incomplete data

- Are there any notable trends or patterns in the number of releases?

Increasing number of games from 1995 to ~2008–2012
Decline after 2013–2016 could reflect market saturation, digital distribution (not captured in this dataset), or missing data

- Is there enough recent data to make predictions for 2017?

Yes, if we focus on 2012 to 2016, we have:
1.A stable and consistent set of game releases
2.A modern console cycle (PS4, Xbox One, etc.)
3.Recent trends and platform performance

## 3.2 Platform Sales Analysis Over Time

Now let's analyze how sales vary across platforms and years:

In [None]:
# Calculate total sales by platform and year
platform_year_sales = df.groupby(['year_of_release', 'platform'])['total_sales'].sum().unstack().fillna(0)

# Preview the table
platform_year_sales.tail()

In [None]:
# Create a heatmap of platform sales over time
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(16, 10))
sns.heatmap(platform_year_sales.T, cmap='YlGnBu', linewidths=0.5)

plt.title('Platform Sales by Year (in millions)', fontsize=16)
plt.xlabel('Year of Release')
plt.ylabel('Platform')
plt.tight_layout()
plt.show()


In [None]:
# Identify platforms with declining sales

# Total sales by platform and year
platform_trends = df.groupby(['year_of_release', 'platform'])['total_sales'].sum().reset_index()

# Pivot the table for easier plotting
pivot_table = platform_trends.pivot(index='year_of_release', columns='platform', values='total_sales')

# Focus on platforms with the highest overall sales
top_platforms = df.groupby('platform')['total_sales'].sum().sort_values(ascending=False).head(10).index

# Plot trends for those platforms
plt.figure(figsize=(14, 8))

for platform in top_platforms:
    plt.plot(pivot_table.index, pivot_table[platform], label=platform)

plt.title('Total Sales Over Time by Platform')
plt.xlabel('Year of Release')
plt.ylabel('Total Sales (millions)')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


### Questions to Consider:
- Which platforms show consistent sales over time?

From the line plot and heatmap, platforms such as:
PS2, DS, PS3, and X360 demonstrated strong and steady sales across multiple years.
In more recent years, PS4 and XOne began to show consistent growth, especially from 2013 onward.
These platforms had long lifespans and were dominant during their console generation cycles.

- Can you identify platforms that have disappeared from the market?

Yes,several platforms experienced a rise in popularity followed by a complete disappearance from recent years. Examples include:

Wii: Sales peaked around 2008–2010, then declined sharply and disappeared by 2014.

PS2, DS, PSP, X360: Strong early 2000s platforms, mostly disappeared by 2013–2015.

Older platforms like GameCube, N64, GBA, and SNES also show no activity after the mid-2000s.

This decline usually follows the release of next-generation consoles

- What's the typical lifecycle of a gaming platform?

Most platforms have a 6–8 year lifecycle, with some outliers lasting longer due to popularity or backward compatibility (e.g., PS2).

## 3.3 Determining Relevant Time Period

Based on your analysis above, determine the appropriate time period for predicting 2017 sales:

In [None]:
# Your code here to filter the dataset to relevant years
# Example:
# relevant_years = [XXXX, XXXX, XXXX] # Replace with your chosen years
# df_relevant = df[df['year_of_release'].isin(relevant_years)]

# Filter to relevant years (2012–2016)
relevant_years = list(range(2012, 2017))
df_relevant = df[df['year_of_release'].isin(relevant_years)]

# Confirm the filter
df_relevant['year_of_release'].value_counts().sort_index()


# Justify your choice with data

# Game count per year
games_per_year = df.groupby('year_of_release')['name'].count()

# Total global sales per year
sales_per_year = df.groupby('year_of_release')['total_sales'].sum()

# Combine into a single DataFrame
yearly_summary = pd.DataFrame({
    'Number of Games': games_per_year,
    'Total Global Sales (millions)': sales_per_year
})

# Filter for the last 10 years (2007–2016) to show the trend
yearly_summary_recent = yearly_summary.loc[2007:2016]

# Display the summary
display(yearly_summary_recent)



In [None]:
#updated time period selection

# Filter to relevant years (2014–2016) based on recent market trends
relevant_years = list(range(2014, 2017))
df_relevant = df[df['year_of_release'].isin(relevant_years)]

# Confirm the filter
df_relevant['year_of_release'].value_counts().sort_index()

# Justification

# Game count per year
games_per_year = df.groupby('year_of_release')['name'].count()

# Total global sales per year
sales_per_year = df.groupby('year_of_release')['total_sales'].sum()

# Combine into a single DataFrame
yearly_summary = pd.DataFrame({
    'Number of Games': games_per_year,
    'Total Global Sales (millions)': sales_per_year
})

# Focus on recent years only
yearly_summary_recent = yearly_summary.loc[2013:2016]

# Display the summary
display(yearly_summary_recent)


<div style="background-color:lightblue; color:darkblue">
I have updated the time period selection to 2013-2016 for forecasting 2017 sales</div>

### Document Your Decision:
- What years did you select and why?

I selected the years 2012 to 2016 as the relevant time period for analyzing and predicting 2017 video game sales.

This 5-year window provides a balance between having enough historical data and ensuring that the market conditions are still relevant to current trends

- How does this period reflect current market conditions?

These years include the latest active console generation at the time — PlayStation 4, Xbox One, Nintendo 3DS, and PC.

The data captures mature sales trends for modern platforms that are still active in 2017.

Obsolete platforms such as PS2, Wii, and DS have already been phased out by this point, ensuring the data is focused on current systems

- What factors influenced your decision?

Volume of game releases: The number of games released per year remains stable from 2012 onward.

Total global sales: Sales data during this period is high and consistent, indicating active consumer demand.

Platform relevance: Platforms dominating this window are still on the market in 2017.

Lifecycle alignment: Platforms like PS4 and XOne launched in 2013, and this time frame captures their growth and maturity phases.

Conclusion: This time period provides a realistic and relevant foundation for forecasting future game performance.



## 3.4 Platform Performance Analysis

Using your selected time period, let's analyze platform performance:

In [None]:
# Analyze platform sales trends

# Total sales by platform for the selected period (2012–2016)
platform_sales = df_relevant.groupby('platform')['total_sales'].sum().sort_values(ascending=False)


In [None]:
# Sort platforms by total sales
# Display sorted sales
platform_sales


In [None]:
# Visualize top platforms

# Plot total sales of top platforms
plt.figure(figsize=(10, 6))
platform_sales.plot(kind='bar')
plt.title('Total Sales by Platform (2012–2016)')
plt.ylabel('Total Sales (millions)')
plt.xlabel('Platform')
plt.grid(axis='y')
plt.tight_layout()
plt.show()


# Calculate year-over-year growth for each platform

# Group by year and platform, then sum total sales
platform_yearly_sales = df_relevant.groupby(['year_of_release', 'platform'])['total_sales'].sum().reset_index()

# Pivot table: years as rows, platforms as columns
platform_sales_pivot = platform_yearly_sales.pivot(index='year_of_release', columns='platform', values='total_sales').fillna(0)

# Calculate YoY growth rates for each platform
platform_growth = platform_sales_pivot.pct_change().fillna(0) * 100  # Convert to percentage

# Display growth table
platform_growth.round(2).tail()

# Your code here to calculate and visualize platform growth rates

# Group by year and platform, then sum total sales
platform_yearly_sales = df_relevant.groupby(['year_of_release', 'platform'])['total_sales'].sum().reset_index()

# Pivot table: years as rows, platforms as columns
platform_sales_pivot = platform_yearly_sales.pivot(index='year_of_release', columns='platform', values='total_sales').fillna(0)

# Calculate YoY growth rates for each platform
platform_growth = platform_sales_pivot.pct_change().fillna(0) * 100  # Convert to percentage

# Display growth table
platform_growth.round(2).tail()



## 3.5 Sales Distribution Analysis

Let's examine the distribution of sales across platforms:

In [None]:
# Create box plot of sales by platform

# Set figure size
plt.figure(figsize=(14, 7))

# Create a box plot of total sales per platform
sns.boxplot(data=df_relevant, x='platform', y='total_sales', showfliers=False)

# Enhance readability
plt.yscale('log')  # Log scale to handle outliers
plt.title('Distribution of Global Sales by Platform (2012–2016)')
plt.xlabel('Platform')
plt.ylabel('Global Sales (millions, log scale)')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()


<div style="background-color:lightblue; color:darkblue">
Great suggestion, I have hidden the outliers by adding showfliers_False to the original code</div>

In [None]:
# Calculate detailed statistics for each platform

# Group by platform and describe sales stats
platform_stats = df_relevant.groupby('platform')['total_sales'].describe().round(2)

# Display results
platform_stats


In [None]:
# Choose a popular platform based on your previous analysis

# Filter the relevant dataset for PS4 only
ps4_data = df_relevant[df_relevant['platform'] == 'PS4']

# Preview data
ps4_data[['name', 'critic_score', 'user_score', 'total_sales']].head()


In [None]:
# Create scatter plots for both critic and user scores


In [None]:
# Critic Scores
plt.figure(figsize=(8, 6))
sns.scatterplot(data=ps4_data, x='critic_score', y='total_sales')
plt.title('Critic Score vs. Total Sales (PS4)')
plt.xlabel('Critic Score')
plt.ylabel('Total Sales (millions)')
plt.grid(True)
plt.tight_layout()
plt.show()

# User Scores
plt.figure(figsize=(8, 6))
sns.scatterplot(data=ps4_data, x='user_score', y='total_sales')
plt.title('User Score vs. Total Sales (PS4)')
plt.xlabel('User Score')
plt.ylabel('Total Sales (millions)')
plt.grid(True)
plt.tight_layout()
plt.show()


# Calculate correlations

# Correlation between critic_score and total_sales
critic_corr = ps4_data[['critic_score', 'total_sales']].corr().iloc[0, 1]

# Correlation between user_score and total_sales
user_corr = ps4_data[['user_score', 'total_sales']].corr().iloc[0, 1]

# Display results
print(f"Correlation between Critic Score and Sales (PS4): {critic_corr:.2f}")
print(f"Correlation between User Score and Sales (PS4): {user_corr:.2f}")


## 3.7 Cross-Platform Comparison

Compare sales performance of games across different platforms:

In [None]:
# Find games released on multiple platforms
#I will group by game name and count how many platforms each game appears on:

# Count how many platforms each game appears on
multi_platform_games = df_relevant.groupby('name')['platform'].nunique()

# Filter for games released on 2 or more platforms
multi_platform_games = multi_platform_games[multi_platform_games > 1]

# Get only the records for these multi-platform games
df_multi_platform = df_relevant[df_relevant['name'].isin(multi_platform_games.index)]


In [None]:
# Compare sales across platforms for these games
# Your code here to analyze and visualize cross-platform performance

# I'll visualize how the same game performs differently by platform and use a box plot to show general trends:

plt.figure(figsize=(14, 7))
sns.boxplot(data=df_multi_platform, x='platform', y='total_sales')

plt.title('Sales Distribution of Multi-Platform Games (2012–2016)')
plt.xlabel('Platform')
plt.ylabel('Total Sales (millions)')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()


# Compare Average sale per platform for shared games

# Calculate average sales for each game on each platform
avg_sales = df_multi_platform.groupby(['name', 'platform'])['total_sales'].mean().reset_index()

# Pivot to compare side by side
sales_comparison = avg_sales.pivot(index='name', columns='platform', values='total_sales')

# Show first few rows
sales_comparison.head()


## 3.8 Genre Analysis

Finally, let's examine the distribution of games by genre:

In [None]:
# Analyze genre performance

# Total sales by genre
genre_sales = df_relevant.groupby('genre')['total_sales'].sum().sort_values(ascending=False)

# Display sales
genre_sales


In [None]:
# Sort genres by total sales

# Group by genre and sum total sales, then sort in descending order
genre_sales = df_relevant.groupby('genre')['total_sales'].sum().sort_values(ascending=False)

# Display result
genre_sales



In [None]:
# Visualize genre distribution

# Bar plot of total sales by genre
plt.figure(figsize=(12, 6))
genre_sales.plot(kind='bar', color='skyblue')

plt.title('Total Global Sales by Genre (2012–2016)')
plt.xlabel('Genre')
plt.ylabel('Total Sales (millions)')
plt.grid(axis='y')
plt.tight_layout()
plt.show()


In [None]:
# Calculate market share for each genre

# Calculate percentage share of each genre
genre_market_share = (genre_sales / genre_sales.sum()) * 100


# Pie chart of market share
plt.figure(figsize=(9, 9))
plt.pie(genre_market_share, labels=genre_market_share.index, autopct='%1.1f%%', startangle=140)
plt.title('Genre Market Share (2012–2016)')
plt.tight_layout()
plt.show()



### Key Questions for Genre Analysis:
- Which genres consistently perform well?
- Are there any genres showing recent growth or decline?
- How does the average performance vary across genres?

# Which genres consistently perform well?

Based on total global sales from 2012 to 2016, the top-performing genres are:

Action — by far the most dominant, likely due to a large number of releases and popularity on major platforms.

Shooter — consistently strong, driven by franchises like Call of Duty, Battlefield, etc.

Sports — performs well every year, with reliable franchises like FIFA and NBA 2K.

These genres have a large player base, broad appeal, and frequent releases across all major platforms.

In [None]:
# Are there any genres showing recent growth or decline?
# To determine this, I can group by both genre and year, then sum total sales:

# Total sales by genre and year
genre_trends = df_relevant.groupby(['year_of_release', 'genre'])['total_sales'].sum().unstack().fillna(0)

# Plot example for top genres
genre_trends[['Action', 'Shooter', 'Sports']].plot(figsize=(12, 6), title='Top Genre Trends (2012–2016)')


# Conclusion for recent growth or decline
From such a plot, you may observe:

Action and Shooter genres have relatively stable performance.

Role-Playing may show growth due to titles like The Witcher 3 and Pokemon.

Some genres like Racing and Puzzle may be declining in relevance.

In [None]:
#Average performance arcoss Genres
avg_sales_per_genre = df_relevant.groupby('genre')['total_sales'].mean().sort_values(ascending=False)
avg_sales_per_genre


# Conclusion for Average performance

Shooter and Role-Playing games often have higher average sales per title — fewer games, but they sell more.

Puzzle, Strategy, and Simulation genres have lower average sales, suggesting niche markets.

# Summary

Top sellers: Action, Shooter, Sports

High average earners: Shooter, Role-Playing

Potential growth areas: Role-Playing, Adventure

Declining or niche: Puzzle, Strategy, Simulation

This analysis can help shape 2017 marketing and release strategies for targeted genres.

# Step 4: Regional Market Analysis and User Profiles

In this section, we will analyze the gaming market characteristics across three major regions: North America (NA), Europe (EU), and Japan (JP). Our analysis will focus on platform preferences, genre popularity, and the impact of ESRB ratings in each region.

## 4.1 Regional Platform Analysis

Let's begin by examining platform performance across different regions:

In [None]:
# Function to analyze platform performance by region

def top_platforms_by_region(df, region_col, top_n=5):
    # Group by platform and sum sales for the region
    region_platform_sales = df.groupby('platform')[region_col].sum().sort_values(ascending=False).head(top_n)
    
    # Display the results
    print(f"Top {top_n} platforms in {region_col.upper()}:")
    display(region_platform_sales)
    
    # Plot the results
    plt.figure(figsize=(8, 5))
    region_platform_sales.plot(kind='bar', color='teal')
    plt.title(f'Top {top_n} Platforms in {region_col.upper()}')
    plt.xlabel('Platform')
    plt.ylabel(f'Sales in {region_col.upper()} (millions)')
    plt.grid(axis='y')
    plt.tight_layout()
    plt.show()



In [None]:
# Analyze each region

#North America
top_platforms_by_region(df_relevant, 'na_sales')

#Europe
top_platforms_by_region(df_relevant, 'eu_sales')

#Japan
top_platforms_by_region(df_relevant, 'jp_sales')



### Cross-Regional Platform Comparison

Let's create a comparative analysis of platform performance across regions:

In [None]:
# Create a comparative platform analysis

# Calculate total sales per platform by region

# Sum sales by platform and region
regional_platform_sales = df_relevant.groupby('platform')[['na_sales', 'eu_sales', 'jp_sales']].sum()

# Sort platforms by total global sales to get top platforms
top_platforms = df_relevant.groupby('platform')['total_sales'].sum().sort_values(ascending=False).head(5).index

# Filter only top platforms
regional_top_platforms = regional_platform_sales.loc[top_platforms]

# Display table
regional_top_platforms


In [None]:
# Visualize cross-regional comparison for top platforms

# Plot grouped bar chart
regional_top_platforms.plot(kind='bar', figsize=(10, 6))
plt.title('Top 5 Platforms by Region (2012–2016)')
plt.xlabel('Platform')
plt.ylabel('Sales (millions)')
plt.xticks(rotation=0)
plt.grid(axis='y')
plt.legend(title='Region')
plt.tight_layout()
plt.show()


## 4.2 Regional Genre Analysis

Now let's examine genre preferences across regions:

In [None]:
# Function to analyze genre performance by region

def top_genres_by_region(df, region_col, top_n=5):
    # Group by genre and sum sales in the specified region
    genre_sales = (
        df.groupby('genre')[region_col]
        .sum()
        .sort_values(ascending=False)
        .head(top_n)
    )

    # Plotting
    plt.figure(figsize=(8, 5))
    genre_sales.plot(kind='bar', color='coral')
    plt.title(f'Top {top_n} Genres in {region_col.upper()}')
    plt.xlabel('Genre')
    plt.ylabel(f'Sales in {region_col.upper()} (millions)')
    plt.grid(axis='y')
    plt.tight_layout()
    plt.show()

    # Print sales for reference
    print(f"\nTop {top_n} genres in {region_col.upper()}:")
    print(genre_sales)

    
# North America
top_genres_by_region(df_relevant, 'na_sales')

#Europe
top_genres_by_region(df_relevant, 'eu_sales')

#Japan
top_genres_by_region(df_relevant, 'jp_sales')


### Cross-Regional Genre Comparison

Let's compare genre preferences across regions:

In [None]:
# Create a comparative genre analysis

# Group by genre and sum sales in each region
regional_genre_sales = df_relevant.groupby('genre')[['na_sales', 'eu_sales', 'jp_sales']].sum()

# Sort by total sales in NA for consistent comparison
regional_genre_sales = regional_genre_sales.sort_values(by='na_sales', ascending=False)

# Display the table
regional_genre_sales


In [None]:
# visualize genre preference

# Plot grouped bar chart
regional_genre_sales.plot(kind='bar', figsize=(12, 6))
plt.title('Genre Sales Comparison by Region (2012–2016)')
plt.xlabel('Genre')
plt.ylabel('Sales (millions)')
plt.xticks(rotation=45)
plt.legend(title='Region')
plt.grid(axis='y')
plt.tight_layout()
plt.show()


## 4.3 ESRB Rating Impact Analysis

Finally, let's examine how ESRB ratings affect sales in each region:

In [None]:
# Function to analyze ESRB rating impact

def esrb_impact_by_region(df, region_col):
    # Group by ESRB rating and sum regional sales
    esrb_sales = (
        df.groupby('rating')[region_col]
        .sum()
        .sort_values(ascending=False)
    )

    # Plot
    plt.figure(figsize=(8, 5))
    esrb_sales.plot(kind='bar', color='slateblue')
    plt.title(f'ESRB Rating Impact on Sales in {region_col.upper()}')
    plt.xlabel('ESRB Rating')
    plt.ylabel(f'Sales in {region_col.upper()} (millions)')
    plt.grid(axis='y')
    plt.tight_layout()
    plt.show()

    # Print results
    print(f"\nESRB impact in {region_col.upper()}:")
    print(esrb_sales)


In [None]:
# Analyze ESRB impact for each region

#North America
esrb_impact_by_region(df_relevant, 'na_sales')

#Europe
esrb_impact_by_region(df_relevant, 'eu_sales')

#Japan
esrb_impact_by_region(df_relevant, 'jp_sales')


# Step 5 : Hypothesis Tests

—Average user ratings of the Xbox One and PC platforms are the same.

—Average user ratings for the Action and Sports genres are different.

Set the *alpha* threshold value yourself.

Explain:

—How you formulated the null and alternative hypotheses

—What criteria you used to test the hypotheses~~,~~ and why


In [None]:
# Set the alpha threshold
alpha = 0.05


# Hypothesis 1: Xbox One vs. PC User Ratings
Formulation of Hypotheses
Null Hypothesis (H₀): The average user ratings for Xbox One and PC games are equal.

Alternative Hypothesis (H₁): The average user ratings for Xbox One and PC games are different.

Why t-test?
We’re comparing the means of two independent groups — Xbox One vs. PC — based on user ratings. A two-sample (independent) t-test is appropriate here.



In [None]:
from scipy import stats

# Filter and clean user scores for Xbox One and PC
xbox_scores = df_relevant[(df_relevant['platform'] == 'XOne') & (df_relevant['user_score'].notnull())]['user_score']
pc_scores = df_relevant[(df_relevant['platform'] == 'PC') & (df_relevant['user_score'].notnull())]['user_score']

# Perform independent t-test
t_stat1, p_val1 = stats.ttest_ind(xbox_scores, pc_scores, equal_var=False)  # Welch’s t-test

print(f"T-statistic: {t_stat1:.4f}, p-value: {p_val1:.4f}")

# Interpret result
if p_val1 < alpha:
    print("We reject the null hypothesis: Average user ratings for Xbox One and PC are significantly different.")
else:
    print("We fail to reject the null hypothesis: No significant difference in user ratings between Xbox One and PC.")


# Hypothesis 2: Action vs. Sports Genre User Ratings

Formulation of Hypotheses
Null Hypothesis (H₀): The average user ratings for Action and Sports genres are equal.

Alternative Hypothesis (H₁): The average user ratings for Action and Sports genres are different.

Why t-test?
Again, we’re comparing means between two independent groups (Action vs. Sports), making the t-test appropriate.

In [None]:
# Filter and clean user scores for Action and Sports genres
action_scores = df_relevant[(df_relevant['genre'] == 'Action') & (df_relevant['user_score'].notnull())]['user_score']
sports_scores = df_relevant[(df_relevant['genre'] == 'Sports') & (df_relevant['user_score'].notnull())]['user_score']

# Perform independent t-test
t_stat2, p_val2 = stats.ttest_ind(action_scores, sports_scores, equal_var=False)

print(f"T-statistic: {t_stat2:.4f}, p-value: {p_val2:.4f}")

# Interpret result
if p_val2 < alpha:
    print("We reject the null hypothesis: There is a significant difference in user ratings between Action and Sports genres.")
else:
    print("We fail to reject the null hypothesis: No significant difference in user ratings between Action and Sports genres.")


# Step 6. Write a general conclusion
