## Import Library

In [30]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px

## Data Loading and Inspection

In [31]:
df = pd.read_csv('vgsales.csv')
df.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


The dataset used in this analysis contains information about video game sales, including various attributes such as the rank, name, platform, year of release, genre, publisher, and sales figures in different regions (North America, Europe, Japan, and others). Each row represents a video game entry, and the columns provide specific details about each game.

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16598 non-null  int64  
 1   Name          16598 non-null  object 
 2   Platform      16598 non-null  object 
 3   Year          16327 non-null  float64
 4   Genre         16598 non-null  object 
 5   Publisher     16540 non-null  object 
 6   NA_Sales      16598 non-null  float64
 7   EU_Sales      16598 non-null  float64
 8   JP_Sales      16598 non-null  float64
 9   Other_Sales   16598 non-null  float64
 10  Global_Sales  16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB


## Data Cleaning and Transformation

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16598 non-null  int64  
 1   Name          16598 non-null  object 
 2   Platform      16598 non-null  object 
 3   Year          16327 non-null  float64
 4   Genre         16598 non-null  object 
 5   Publisher     16540 non-null  object 
 6   NA_Sales      16598 non-null  float64
 7   EU_Sales      16598 non-null  float64
 8   JP_Sales      16598 non-null  float64
 9   Other_Sales   16598 non-null  float64
 10  Global_Sales  16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB


In [34]:
df.isnull().sum()

Rank              0
Name              0
Platform          0
Year            271
Genre             0
Publisher        58
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
dtype: int64

In [35]:
df['Publisher'].fillna('Unknown', inplace=True)
df.dropna(subset=['Year'], inplace=True)

In [36]:
df['Year'] = df['Year'].astype(int)

In [37]:
df.isnull().sum()

Rank            0
Name            0
Platform        0
Year            0
Genre           0
Publisher       0
NA_Sales        0
EU_Sales        0
JP_Sales        0
Other_Sales     0
Global_Sales    0
dtype: int64

In this analysis, the data underwent cleaning and transformation steps. These included examining data information, converting the 'Year' column to integer type, removing rows with NaN values in 'Year', replacing NaN values in the 'Publisher' column with 'Unknown', and rechecking for NaN values. These steps ensured data integrity and readiness for further analysis or modeling.

## Data Analysis and Visualization

In [38]:
print(df.describe())

               Rank          Year      NA_Sales      EU_Sales      JP_Sales  \
count  16327.000000  16327.000000  16327.000000  16327.000000  16327.000000   
mean    8292.868194   2006.406443      0.265415      0.147554      0.078661   
std     4792.669778      5.828981      0.821591      0.508766      0.311557   
min        1.000000   1980.000000      0.000000      0.000000      0.000000   
25%     4136.500000   2003.000000      0.000000      0.000000      0.000000   
50%     8295.000000   2007.000000      0.080000      0.020000      0.000000   
75%    12441.500000   2010.000000      0.240000      0.110000      0.040000   
max    16600.000000   2020.000000     41.490000     29.020000     10.220000   

        Other_Sales  Global_Sales  
count  16327.000000  16327.000000  
mean       0.048325      0.540232  
std        0.189885      1.565732  
min        0.000000      0.010000  
25%        0.000000      0.060000  
50%        0.010000      0.170000  
75%        0.040000      0.480000  


From the descriptive statistics provided, we can derive several insights about the video game sales dataset:

1. **Average Sales**:
   - The mean global sales of a video game is approximately 0.54 million.
   - The mean sales in North America, Europe, and Japan are approximately 0.27 million, 0.15 million, and 0.08 million, respectively.

2. **Sales Distribution**:
   - The standard deviation for global sales indicates high variability in sales figures, with some games achieving significantly higher sales than others.
   - The 75th percentile for global sales is 0.48 million, suggesting that the majority of games have relatively low sales figures, while a few games achieve much higher sales.

3. **Regional Sales Trends**:
   - North America has the highest average sales among all regions, with an average of approximately 0.27 million per game.
   - European sales follow closely behind, with an average of approximately 0.15 million per game.
   - Japan has the lowest average sales among the regions, with an average of approximately 0.08 million per game.
   
4. **Year of Release**:
   - The dataset spans from 1980 to 2020, indicating a wide range of release years for the games included.
   - The median release year is 2007, suggesting that the dataset contains a significant number of recent games.

5. **Global Sales Distribution**:
   - The distribution of global sales is right-skewed, with a large number of games having relatively low sales, while a few games achieve exceptionally high sales figures.
   - The maximum global sales figure recorded is 82.74 million, indicating the presence of blockbuster games in the dataset.

These insights provide valuable information about the distribution and trends in video game sales, which can be used by stakeholders in the gaming industry for strategic decision-making, marketing strategies, and investment decisions.


In [39]:
# Calculate the count of each genre in the DataFrame
genre_counts = df['Genre'].value_counts()
# Define the color scheme for the bar chart
colors = px.colors.sequential.Blues  # Sequential color scheme from Plotly
# Create a bar chart data object with genre counts, colors, and text
bar_data = go.Bar(
    x=genre_counts.index,  # X-axis values (genres)
    y=genre_counts.values,  # Y-axis values (number of games per genre)
    marker=dict(color=colors),  # Bar color based on the defined color scheme
    text=genre_counts.values,   # Text to be displayed on the bars (number of games)
    textposition='auto'  # Automatically position the text on the bars
)
# Define layout settings for the bar chart
layout = go.Layout(
    title='Number of Games per Genre',  # Title of the chart
    xaxis=dict(title='Genre'),  # X-axis label
    yaxis=dict(title='Number of Games')  # Y-axis label
)
# Create a Figure object to combine the data and layout
fig = go.Figure(data=bar_data, layout=layout)
# Display the figure (bar chart) in the notebook
fig.show()

In [40]:
# Group the DataFrame by 'Genre' and calculate the total global sales for each genre
genre_sales = df.groupby('Genre')['Global_Sales'].sum().reset_index()
# Define the color scheme for the pie chart
colors = px.colors.sequential.Blues  # Sequential color scheme from Plotly
# Create a pie chart using Plotly Express, specifying values, names, title, and colors
fig = px.pie(
    genre_sales,  # DataFrame containing data to be plotted
    values='Global_Sales',  # Values to be represented as sector sizes (global sales)
    names='Genre',  # Names of the sectors (genres)
    title='Sales by Genre',  # Title of the pie chart
    color='Genre',  # Color each sector based on genre
    color_discrete_sequence=colors  # Assign colors from the defined color scheme
)
# Display the pie chart in the notebook
fig.show()

Understanding the distribution of games across different genres and their corresponding sales performance is crucial for stakeholders in the gaming industry to make informed decisions and tailor their strategies effectively. Here are key insights derived from the data:

1. **Genre Analysis:**
   - **Popular Genres:** Action and Sports emerge as the most prevalent genres, with 3253 and 2304 games released respectively. This indicates their popularity among developers and potentially reflects consumer demand for these types of games.
   - **Sales Performance:** Despite Action and Sports games being abundant, Shooter games lead in terms of sales contribution, accounting for 11.63% of total sales. This suggests that while certain genres may have fewer releases, they can yield higher revenue per game, as seen with Shooter games.

2. **Sales Distribution by Genre:**
   - **Top Performing Genres:** Sports games dominate in terms of sales contribution, accounting for 14.84% of total sales. This highlights the significant market share held by Sports games, indicating their enduring appeal among consumers.
   - **Role-Playing and Shooter Games:** Role-Playing and Shooter genres also demonstrate strong sales performance, with 10.47% and 11.63% of total sales respectively. This underscores the popularity of immersive storytelling and action-packed gameplay experiences in these genres.

3. **Strategic Implications:**
   - **Targeted Marketing:** Publishers and developers can leverage insights from genre analysis to tailor their marketing efforts and target specific audience segments more effectively. For instance, they can allocate more resources towards promoting Shooter games given their high revenue potential.
   - **Diversification:** While certain genres like Sports and Shooter perform well, there is still value in diversifying game portfolios to appeal to a broader audience. Exploring niche genres or innovative gameplay mechanics can help capture new market segments and mitigate risks associated with genre saturation.


In [41]:
# Group the DataFrame by 'Year' and calculate the total sales for each region (NA, EU, JP, Other)
na_sales = df.groupby('Year')['NA_Sales'].sum()  # Total sales in North America
eu_sales = df.groupby('Year')['EU_Sales'].sum()  # Total sales in Europe
jp_sales = df.groupby('Year')['JP_Sales'].sum()  # Total sales in Japan
other_sales = df.groupby('Year')['Other_Sales'].sum()  # Total sales in other regions
# Create a new Figure object
fig = go.Figure()
# Add traces for each region's sales over the years
fig.add_trace(go.Scatter(x=na_sales.index, y=na_sales.values, mode='lines+markers', name='NA Sales', line=dict(color='blue')))  # Trace for NA sales
fig.add_trace(go.Scatter(x=eu_sales.index, y=eu_sales.values, mode='lines+markers', name='EU Sales', line=dict(color='red')))  # Trace for EU sales
fig.add_trace(go.Scatter(x=jp_sales.index, y=jp_sales.values, mode='lines+markers', name='JP Sales', line=dict(color='green')))  # Trace for JP sales
fig.add_trace(go.Scatter(x=other_sales.index, y=other_sales.values, mode='lines+markers', name='Other Sales', line=dict(color='purple')))  # Trace for other region sales
# Update layout settings for the plot
fig.update_layout(title='Sales per Year by Region', xaxis_title='Year', yaxis_title='Sales Value')
# Display the plot in the notebook
fig.show()

In [42]:
# Calculate the count of games released each year and sort the values by year
game_count_per_year = df['Year'].value_counts().sort_index()
# Prepare data for the histogram plot (Bar chart)
histogram_data = go.Bar(
    x=game_count_per_year.index.astype(str),  # X-axis values (years), converted to string
    y=game_count_per_year.values  # Y-axis values (number of games released each year)
)
# Define layout settings for the histogram plot
layout = go.Layout(
    title='Distribution of Games Released Each Year',  # Title of the plot
    xaxis=dict(title='Year'),  # Label for the x-axis
    yaxis=dict(title='Number of Games Released')  # Label for the y-axis
)
# Create a Figure object to combine the data and layout
fig = go.Figure(data=histogram_data, layout=layout)
# Display the histogram plot in the notebook
fig.show()

1. **Distribution of Games Released Over Time:** The distribution of games released each year reflects the evolution and growth of the gaming industry. There is a noticeable upward trend in the number of games released annually, indicating the expanding scope of the gaming market and the increasing interest from developers and consumers alike.

2. **Peak Years:** The early 2000s, particularly around 2002 and 2003, witnessed a significant surge in game releases, with 829 and 775 games respectively. This period likely coincided with technological advancements, such as the introduction of more powerful gaming consoles and the growing popularity of PC gaming, enabling developers to create and release a larger volume of games.

3. **Market Saturation:** While the number of games released continues to rise, there are fluctuations in annual figures, suggesting periods of market saturation or adjustment. For instance, there was a notable decrease in game releases in 2016 compared to the preceding years, indicating potential shifts in industry dynamics or development trends.

4. **Sales Performance by Year and Region:** Analyzing sales performance alongside the distribution of games released each year provides insights into market trends and consumer behavior. For example, while 2008 saw a high number of game releases, it also generated substantial sales figures across multiple regions, indicating strong consumer demand and market penetration.

5. **Regional Disparities:** Variations in sales figures across different regions highlight the importance of considering regional preferences and market dynamics in game development and distribution strategies. For instance, Japan consistently contributes a significant portion of sales, underscoring the importance of catering to diverse global audiences and localizing content accordingly.

6. **Emerging Markets:** The inclusion of data from recent years, such as 2013 onwards, allows for the identification of emerging market trends and potential growth opportunities. By monitoring sales performance and game releases in newer markets, developers and publishers can adapt their strategies to capitalize on evolving consumer demands and market dynamics.

In [43]:
# Calculate the count of games released for each platform and select the top 10 platforms
top_platforms = df['Platform'].value_counts().sort_values(ascending=True).tail(10)
# Prepare data for the horizontal bar chart (Bar chart)
platform_data = go.Bar(
    y=top_platforms.index,  # Y-axis values (platform names)
    x=top_platforms.values,  # X-axis values (number of games released for each platform)
    name='Platform',  # Name of the data series
    orientation='h'  # Orientation of the bars (horizontal)
)
# Define layout settings for the horizontal bar chart
layout_platform = go.Layout(
    title='Top 10 Platforms by Number of Games Released',  # Title of the plot
    xaxis=dict(title='Number of Games Released'),  # Label for the x-axis
    yaxis=dict(title='Platform')  # Label for the y-axis
)
# Create a Figure object to combine the data and layout
fig_platform = go.Figure(data=platform_data, layout=layout_platform)
# Display the horizontal bar chart in the notebook
fig_platform.show()

In [44]:
# Group the DataFrame by 'Platform' and calculate the total global sales for each platform
platform_sales = df.groupby('Platform')['Global_Sales'].sum()
# Select the top 10 platforms with the highest total global sales
top_platforms_sales = platform_sales.sort_values(ascending=True).tail(10)
# Prepare data for the horizontal bar chart (Bar chart)
platform_data_sales = go.Bar(
    y=top_platforms_sales.index,  # Y-axis values (platform names)
    x=top_platforms_sales.values,  # X-axis values (total global sales for each platform)
    name='Platform',  # Name of the data series
    orientation='h'  # Orientation of the bars (horizontal)
)
# Define layout settings for the horizontal bar chart
layout_platform_sales = go.Layout(
    title='Top 10 Platforms by Sales',  # Title of the plot
    xaxis=dict(title='Global Sales'),  # Label for the x-axis
    yaxis=dict(title='Platform')  # Label for the y-axis
)
# Create a Figure object to combine the data and layout
fig_platform_sales = go.Figure(data=platform_data_sales, layout=layout_platform_sales)
# Display the horizontal bar chart in the notebook
fig_platform_sales.show()

The comparison between the top platforms by the number of games released and their respective sales figures provides valuable insights into platform popularity and market performance in the gaming industry.

1. **Platform Popularity vs. Sales Performance:** While platforms like Xbox (XB), Game Boy Advance (GBA), and PlayStation (PS) have a significant number of games released, platforms like PlayStation 4 (PS4) and PlayStation 3 (PS3) emerge as top performers in terms of sales. This indicates that while a platform may have a large library of games, its sales performance is ultimately influenced by factors such as hardware capabilities, exclusive titles, and market demand.

2. **Diversification of Platforms:** The presence of diverse platforms in the top rankings highlights the variety of gaming experiences available to consumers. From traditional consoles like PlayStation and Xbox to handheld devices like PSP and DS, players have access to a wide range of gaming options catering to different preferences and demographics.

3. **Market Dominance:** PlayStation platforms, including PS, PS3, PS2, and PSP, demonstrate strong market dominance both in terms of the number of games released and sales figures. This underscores the brand's enduring popularity and the loyalty of its consumer base, making it a formidable force in the gaming industry.

4. **Emerging Trends:** Platforms like PC and PS4 exhibit strong sales performance despite having a comparatively lower number of games released. This suggests a growing trend towards digital distribution and online gaming, as well as the influence of exclusive titles and hardware advancements in driving platform adoption and sales.

5. **Longevity and Legacy:** Platforms like Wii and DS, despite being older platforms, continue to generate significant sales figures. This highlights the enduring appeal of classic gaming experiences and the importance of backward compatibility and nostalgia in driving consumer engagement and sales.

In [45]:
# Calculate the count of games released for each publisher and select the top 10 publishers
top_publishers = df['Publisher'].value_counts().sort_values(ascending=True).tail(10)
# Prepare data for the horizontal bar chart (Bar chart)
publisher_data = go.Bar(
    y=top_publishers.index,  # Y-axis values (publisher names)
    x=top_publishers.values,  # X-axis values (number of games released for each publisher)
    name='Publisher',  # Name of the data series
    orientation='h'  # Orientation of the bars (horizontal)
)
# Define layout settings for the horizontal bar chart
layout_publisher = go.Layout(
    title='Top 10 Publishers by Number of Games Released',  # Title of the plot
    xaxis=dict(title='Number of Games Released'),  # Label for the x-axis
    yaxis=dict(title='Publisher')  # Label for the y-axis
)
# Create a Figure object to combine the data and layout
fig_publisher = go.Figure(data=publisher_data, layout=layout_publisher)
# Display the horizontal bar chart in the notebook
fig_publisher.show()

In [46]:
# Group the DataFrame by 'Publisher' and calculate the total global sales for each publisher
publisher_sales = df.groupby('Publisher')['Global_Sales'].sum()
# Select the top 10 publishers with the highest total global sales
top_publishers_sales = publisher_sales.sort_values(ascending=True).tail(10)
# Prepare data for the horizontal bar chart (Bar chart)
publisher_data_sales = go.Bar(
    y=top_publishers_sales.index,  # Y-axis values (publisher names)
    x=top_publishers_sales.values,  # X-axis values (total global sales for each publisher)
    name='Publisher',  # Name of the data series
    orientation='h'  # Orientation of the bars (horizontal)
)
# Define layout settings for the horizontal bar chart
layout_publisher_sales = go.Layout(
    title='Top 10 Publishers by Sales',  # Title of the plot
    xaxis=dict(title='Global Sales'),  # Label for the x-axis
    yaxis=dict(title='Publisher')  # Label for the y-axis
)
# Create a Figure object to combine the data and layout
fig_publisher_sales = go.Figure(data=publisher_data_sales, layout=layout_publisher_sales)
# Display the horizontal bar chart in the notebook
fig_publisher_sales.show()

The comparison between the top publishers by the number of games released and their respective sales figures reveals interesting patterns and insights into the gaming industry.

1. **Publishing Volume vs. Sales Performance:** While Electronic Arts leads in the number of games released with 1339 titles, Nintendo emerges as the top performer in terms of sales with $1784.43 million in global sales. This indicates that while releasing a large number of games can contribute to market presence, sales performance is ultimately driven by factors such as game quality, brand reputation, and marketing strategies.

2. **Consistency in Performance:** Publishers like Namco Bandai Games, Sega, and Konami Digital Entertainment appear in the top rankings for both the number of games released and sales figures. This suggests a consistent track record of delivering both quantity and quality in their game offerings, showcasing their strong position in the market.

3. **Diverse Strategies:** Publishers like Sony Computer Entertainment and Ubisoft demonstrate diverse strategies, with Sony focusing more on sales performance (607.28 million) despite releasing fewer games compared to Ubisoft, which releases a higher volume of games (918 titles) but achieves slightly lower sales (473.54 million). This highlights the importance of understanding market dynamics and tailoring strategies to maximize revenue potential.

4. **Opportunities for Growth:** While certain publishers, such as Electronic Arts and Activision, have a significant presence in terms of the number of games released, there may be opportunities to enhance sales performance through strategic initiatives such as content innovation, partnership expansions, and targeted marketing campaigns.

5. **Market Leadership:** Nintendo's dominance in both game releases and sales underscores its strong position as a market leader. By consistently delivering high-quality titles and leveraging iconic intellectual properties, Nintendo has established itself as a powerhouse in the gaming industry, setting a benchmark for other publishers to aspire to.

## Conclusion

The analysis of various facets of the gaming industry yields valuable insights into its evolution, market dynamics, and strategic considerations. From examining the distribution of games released over time to analyzing sales performance by genre and region, several key takeaways emerge:

1. **Industry Evolution:** The gaming industry has witnessed significant growth and transformation over the years, characterized by a steady increase in the number of games released annually. This growth reflects technological advancements, changing consumer preferences, and the expanding global market for gaming.

2. **Genre Trends:** Certain genres, such as Action, Sports, and Shooter, emerge as perennial favorites among developers and consumers alike. While Action and Sports games dominate in terms of sheer numbers, Shooter games stand out for their higher revenue generation per game, highlighting the importance of balancing quantity with quality.

3. **Regional Dynamics:** Regional disparities in sales figures underscore the need for localization and tailored marketing strategies to cater to diverse audiences across different regions. Japan consistently contributes a notable portion of sales, emphasizing the significance of understanding local preferences and cultural nuances.

4. **Market Saturation and Opportunities:** Fluctuations in annual game releases suggest periods of market saturation or adjustment, prompting stakeholders to innovate and explore new market segments. Emerging markets present untapped opportunities for growth, necessitating proactive strategies to capitalize on evolving consumer demands and market dynamics.

5. **Strategic Implications:** The insights gleaned from data analysis enable stakeholders to make informed decisions regarding game development, publishing, and marketing. Targeted marketing efforts, diversification of game portfolios, and strategic partnerships can help maximize market share and profitability in an increasingly competitive landscape.

In conclusion, the gaming industry continues to evolve and thrive, driven by innovation, technological advancements, and shifting consumer preferences. By leveraging data-driven insights and strategic foresight, stakeholders can navigate challenges, capitalize on opportunities, and position themselves for sustained success in this dynamic and vibrant industry.