# VIDEO GAMES ANALYSIS <img src="video_games.jpg">

# OVERVIEW

This analysis aims to explore video game sales data with a focus on identifying trends and insights. The study will include the following key objectives:

Global Sales Analysis: Examine the top 100 most-sold games globally to identify sales patterns and performance.<p/>
Genre and Platform Analysis: Analyze video game sales based on genres and platforms to understand preferences and trends.<p/>
Regional Genre Preferences: Investigate game genres across different regions to identify variations in regional preferences.<p/>
Game Title Insights: Generate a word cloud based on game titles to highlight popular themes or keywords.<p/>
Release Year and Publisher Analysis: Analyze the release years of the top 1000 most-sold games and their associated publishers to identify historical trends and key contributors.<p/>
Descriptive Insights: Provide additional information about games, publishers, and platforms for context and understanding.<p/>
This comprehensive analysis will offer valuable insights into the video game industry, helping to understand the factors driving sales and popularity.<p/>

Their fields and data types are:

Rank - Ranking of overall sales, integer

Name - The games name

Platform - Platform of the games release (i.e. PC,PS4, etc.), object

Year - Year of the game's release, float

Genre - Genre of the game ,object

Publisher - Publisher of the game, object

NA_Sales - Sales in North America (in millions), float

EU_Sales - Sales in Europe (in millions), float

JP_Sales - Sales in Japan (in millions), float

Other_Sales - Sales in the rest of the world (in millions), float

Global_Sales - Total worldwide sales, float

### Dataset
number of rows  16598 <p/>
number of columns 11  <p/>

### Import necessary libraries

In [1]:
import pandas as pd
import numpy as np
import dtale
import matplotlib.pyplot as plt
import seaborn as sns

### Load data

In [2]:
vg=pd.read_csv("vgsales.csv")

### Data Undestanding

In [3]:
vg.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


In [4]:
vg.sample(5)

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
12171,12173,Dreamer Series: Pop Star,DS,2009.0,Misc,Tivola,0.04,0.02,0.0,0.01,0.07
10088,10090,Queen's Blade: Spiral Chaos,PSP,2009.0,Role-Playing,Namco Bandai Games,0.0,0.0,0.11,0.0,0.11
9193,9195,Syberia,DS,2008.0,Action,Mindscape,0.1,0.02,0.0,0.01,0.14
6917,6919,Angry Birds,PC,2011.0,Puzzle,Focus Home Interactive,0.0,0.18,0.0,0.05,0.24
10158,10160,18 Wheels of Steel: Extreme Trucker 2,PC,2011.0,Racing,Rondomedia,0.08,0.02,0.0,0.01,0.11


In [5]:
# Exploratory data analysis 
# dtale.show(vg, open_browser=True)

In [6]:
# number of columns and rows
vg.shape

(16598, 11)

### Data Cleaning

issue 1: year contains missing values

In [7]:
vg[["Year", "Publisher"]].isna().sum()

Year         271
Publisher     58
dtype: int64

In [8]:
vg["Year"].isna().sum()

271

plan: fill na in year using forward fill

In [9]:

vg["Year"]=vg["Year"].fillna(method = 'ffill')

In [10]:
# to have an idea of the value count for each publisher
vg["Publisher"].value_counts()

Publisher
Electronic Arts                 1351
Activision                       975
Namco Bandai Games               932
Ubisoft                          921
Konami Digital Entertainment     832
                                ... 
Warp                               1
New                                1
Elite                              1
Evolution Games                    1
UIG Entertainment                  1
Name: count, Length: 578, dtype: int64

In [11]:
# to get the list of publisher with null
missing_publisher = vg[vg["Publisher"].isnull()]

In [12]:
# missing_publisher.to_csv("missing_publisher")

In [13]:
vg.dropna(inplace=True)

test: check missing value result

In [14]:
vg.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


In [15]:
# test:confirm if there are no missing value
vg["Year"].isna().sum()

0

In [16]:
vg.isna().sum()

Rank            0
Name            0
Platform        0
Year            0
Genre           0
Publisher       0
NA_Sales        0
EU_Sales        0
JP_Sales        0
Other_Sales     0
Global_Sales    0
dtype: int64

Issue 2 : year is a string</p>


In [17]:
vg.dtypes

Rank              int64
Name             object
Platform         object
Year            float64
Genre            object
Publisher        object
NA_Sales        float64
EU_Sales        float64
JP_Sales        float64
Other_Sales     float64
Global_Sales    float64
dtype: object

In [18]:
# plan:change year to string
def cleandate(x):   
    if x == 'nan' or pd.isna(x):
        return x
    # Remove '.0' and return the year as a string
    return str(int(float(x)))
    

In [19]:
vg["Year"]=vg["Year"].apply(cleandate)

In [20]:
# add new column to confirm if global sales is the sum of all sales
vg["test_sum"]=vg["NA_Sales"]+vg["EU_Sales"]+vg["JP_Sales"]+vg["Other_Sales"]

In [21]:
# confirmation, it is confirmed that Global sales is the sum of all sales
vg.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,test_sum
0,1,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74,82.74
1,2,Super Mario Bros.,NES,1985,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,40.24
2,3,Mario Kart Wii,Wii,2008,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82,35.83
3,4,Wii Sports Resort,Wii,2009,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,31.38


In [22]:
# compare the value between the rounded global sales and test sum
vg["check"]= vg["Global_Sales"].round(1)==vg["test_sum"].round(1)

In [23]:
vg.check

0        True
1        True
2        True
3        True
4        True
         ... 
16593    True
16594    True
16595    True
16596    True
16597    True
Name: check, Length: 16540, dtype: bool

In [24]:
vg[vg["check"] == False]

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,test_sum,check
32,33,Pokemon X/Pokemon Y,3DS,2013,Role-Playing,Nintendo,5.17,4.05,4.34,0.79,14.35,14.35,False
47,48,Gran Turismo 4,PS2,2004,Racing,Sony Computer Entertainment,3.01,0.01,1.10,7.53,11.66,11.65,False
85,86,Mario & Sonic at the Olympic Games,Wii,2007,Sports,Sega,2.58,3.90,0.66,0.91,8.06,8.05,False
154,155,Destiny,PS4,2014,Shooter,Activision,2.49,2.05,0.16,0.96,5.65,5.66,False
176,177,Assassin's Creed II,X360,2009,Action,Ubisoft,3.10,1.56,0.08,0.51,5.27,5.25,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
13002,13004,Ferrari F355 Challenge,PS2,2002,Racing,Sony Computer Entertainment,0.03,0.02,0.00,0.01,0.05,0.06,False
13010,13012,Shifters,PS2,2002,Adventure,3DO,0.03,0.02,0.00,0.01,0.05,0.06,False
13042,13044,Downforce,PS2,2002,Racing,Avalon Interactive,0.03,0.02,0.00,0.01,0.05,0.06,False
13043,13045,Bejeweled Twist,PC,2008,Puzzle,PopCap Games,0.01,0.04,0.00,0.01,0.05,0.06,False


issue3 : some rows in global sales are not direct sum of the individual sales as confirmed above</p>

description: some cumulative value are incorrect by decimals in most cases, this might have been caused by possible rounding off</p>

Solution: keep the new calculated test sum and delete the global sale column

In [25]:
vg.drop(["Global_Sales", "check"], axis =1, inplace = True)

In [26]:
# Rename test_sum to global_sum and also rename name to movie title
vg.rename({"test_sum":"Global_Sum", "Name":"Video_game"}, axis=1, inplace = True)

In [27]:
vg.rename({"test_sum": "Global_Sum"}, axis=1, inplace=True)

Test: check if the column were properly renamed

In [28]:
vg.columns

Index(['Rank', 'Video_game', 'Platform', 'Year', 'Genre', 'Publisher',
       'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sum'],
      dtype='object')

In [29]:
# check if columns were properly deleted
vg.columns

Index(['Rank', 'Video_game', 'Platform', 'Year', 'Genre', 'Publisher',
       'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sum'],
      dtype='object')

In [30]:
vg.dtypes

Rank             int64
Video_game      object
Platform        object
Year            object
Genre           object
Publisher       object
NA_Sales       float64
EU_Sales       float64
JP_Sales       float64
Other_Sales    float64
Global_Sum     float64
dtype: object

In [31]:
vg.head()

Unnamed: 0,Rank,Video_game,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sum
0,1,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008,Racing,Nintendo,15.85,12.88,3.79,3.31,35.83
3,4,Wii Sports Resort,Wii,2009,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.38


### Research questions
which publisher is the best performing publisher <p/>
we are planing to shut down the least performing genre <p/>

In [32]:
# video games with global sum that is  above 50
GLOBAL_SUM_ABOVE_50 =vg[vg["Global_Sum"]>50]

In [33]:
GLOBAL_SUM_ABOVE_50.to_excel("GLOBAL SUM ABOVE 50.xlsx")

In [34]:
# sum of sales for sport and platform genre
vg[(vg["Genre"] == "Sport") | (vg["Genre"] == "Platform")]["Global_Sum"].sum()

830.54

In [35]:
# USING QUERY
vg.query("Genre == 'Sport' or Genre == 'Platform'")["Global_Sum"].sum()

830.54

In [36]:
# what i the total sales racing category in 2006 
# sum of sales for sports and platform
vg[(vg['Genre'] == "Racing") & (vg['Year'] == 2008.0)] ["Global_Sum"].sum()

0.0

In [37]:
#top five movies in europe
vg.sort_values(by="EU_Sales", ascending = False)[["Video_game","EU_Sales"]].head(5)

Unnamed: 0,Video_game,EU_Sales
0,Wii Sports,29.02
2,Mario Kart Wii,12.88
3,Wii Sports Resort,11.01
10,Nintendogs,11.0
16,Grand Theft Auto V,9.27


### Grouping

In [38]:
# total sales by genre
vg.groupby("Genre")["Global_Sum"].sum()

Genre
Action          1749.30
Adventure        238.55
Fighting         445.73
Misc             801.54
Platform         830.54
Puzzle           244.41
Racing           731.76
Role-Playing     927.20
Shooter         1036.80
Simulation       391.67
Sports          1328.96
Strategy         174.57
Name: Global_Sum, dtype: float64

In [39]:
# average sales by genre
vg[["Genre", "EU_Sales", "Global_Sum"]].groupby("Genre").mean()


Unnamed: 0_level_0,EU_Sales,Global_Sum
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1
Action,0.158634,0.528649
Adventure,0.049984,0.186076
Fighting,0.118463,0.526868
Misc,0.124959,0.468189
Platform,0.227523,0.939525
Puzzle,0.087384,0.420671
Racing,0.19101,0.586346
Role-Playing,0.126548,0.623957
Shooter,0.23948,0.792661
Simulation,0.13117,0.453847


In [40]:
# total sales by genre
vg.groupby("Genre").agg({"Global_Sum": ['count', 'mean','sum'],
                         "NA_Sales" : ['count', 'mean','sum'],
                         "JP_Sales" : ['count', 'mean','sum'],
                         "EU_Sales" : ['count', 'mean','sum']})

Unnamed: 0_level_0,Global_Sum,Global_Sum,Global_Sum,NA_Sales,NA_Sales,NA_Sales,JP_Sales,JP_Sales,JP_Sales,EU_Sales,EU_Sales,EU_Sales
Unnamed: 0_level_1,count,mean,sum,count,mean,sum,count,mean,sum,count,mean,sum
Genre,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
Action,3309,0.528649,1749.3,3309,0.265198,877.54,3309,0.048199,159.49,3309,0.158634,524.92
Adventure,1282,0.186076,238.55,1282,0.082426,105.67,1282,0.040554,51.99,1282,0.049984,64.08
Fighting,846,0.526868,445.73,846,0.262317,221.92,846,0.103251,87.35,846,0.118463,100.22
Misc,1712,0.468189,801.54,1712,0.236373,404.67,1712,0.062921,107.72,1712,0.124959,213.93
Platform,884,0.939525,830.54,884,0.505713,447.05,884,0.14793,130.77,884,0.227523,201.13
Puzzle,581,0.420671,244.41,581,0.213046,123.78,581,0.09864,57.31,581,0.087384,50.77
Racing,1248,0.586346,731.76,1248,0.287997,359.42,1248,0.045425,56.69,1248,0.19101,238.38
Role-Playing,1486,0.623957,927.2,1486,0.220242,327.28,1486,0.237052,352.26,1486,0.126548,188.05
Shooter,1308,0.792661,1036.8,1308,0.445405,582.59,1308,0.029266,38.28,1308,0.23948,313.24
Simulation,863,0.453847,391.67,863,0.21241,183.31,863,0.073743,63.64,863,0.13117,113.2


In [41]:
# Get the top 5 publishers by Global_Sum
top_5_publishers = vg.groupby("Publisher")["Global_Sum"].sum().nlargest(5)

print(top_5_publishers)


Publisher
Nintendo                       1786.36
Electronic Arts                1110.15
Activision                      727.11
Sony Computer Entertainment     607.49
Ubisoft                         474.51
Name: Global_Sum, dtype: float64


In [42]:
# visualize
plt.figure

<function matplotlib.pyplot.figure(num=None, figsize=None, dpi=None, *, facecolor=None, edgecolor=None, frameon=True, FigureClass=<class 'matplotlib.figure.Figure'>, clear=False, **kwargs)>

### PIVOT TABLES

In [43]:
vg_pivot=vg.pivot_table(index="Platform", columns="Genre", values="Global_Sum", aggfunc=sum, fill_value = "-")

### THIS SHOWS THAT MONEY SHOULD NOT BE SPENT ON PLATFORMS THAT HAS LOW GLOBAL SALES

In [44]:
### we can reference this sheet in our powerpoint while explaining what we are doing
vg_pivot.to_csv("sales_by_genre_by_platform.csv")

### CORRELATION ANALYSIS
IF THERE IS NO RELATIONSHIP BETWEEN MARKETING COST AND REVENUE, IF WE THEN INCREASE MARKETING COST THERE IS NO WAY REVENUE WILL INCREASE<P/>
Correlation is different from causation<p/>
for positive correlation, if A increases, B increases. for negative correlation, if A increases, B reduces

In [45]:
vg[['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].corr()

Unnamed: 0,NA_Sales,EU_Sales,JP_Sales,Other_Sales
NA_Sales,1.0,0.767672,0.449864,0.634651
EU_Sales,0.767672,1.0,0.435658,0.726326
JP_Sales,0.449864,0.435658,1.0,0.29015
Other_Sales,0.634651,0.726326,0.29015,1.0


In [46]:
# we are planing to shut down the least performing genre
vg.groupby("Genre")["Global_Sum"].sum().nsmallest(5)

Genre
Strategy      174.57
Adventure     238.55
Puzzle        244.41
Simulation    391.67
Fighting      445.73
Name: Global_Sum, dtype: float64

In [47]:
# video game with highest sales in each region
highest_sales= {
                "NA_Sales": vg.loc[vg["NA_Sales"].idxmax(), ["Video_game", "NA_Sales"]],
                "JP_Sales": vg.loc[vg["JP_Sales"].idxmax(), ["Video_game", "JP_Sales"]],
                "EU_Sales":  vg.loc[vg["EU_Sales"].idxmax(), ["Video_game", "EU_Sales"]],
                "Other_Sales": vg.loc[vg["Other_Sales"].idxmax(), ["Video_game", "Other_Sales"]]
}
               
               

In [48]:
highest_sales

{'NA_Sales': Video_game    Wii Sports
 NA_Sales           41.49
 Name: 0, dtype: object,
 'JP_Sales': Video_game    Pokemon Red/Pokemon Blue
 JP_Sales                         10.22
 Name: 4, dtype: object,
 'EU_Sales': Video_game    Wii Sports
 EU_Sales           29.02
 Name: 0, dtype: object,
 'Other_Sales': Video_game     Grand Theft Auto: San Andreas
 Other_Sales                            10.57
 Name: 17, dtype: object}

In [49]:
vg.head()

Unnamed: 0,Rank,Video_game,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sum
0,1,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008,Racing,Nintendo,15.85,12.88,3.79,3.31,35.83
3,4,Wii Sports Resort,Wii,2009,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.38


In [50]:
# which region contributes the most to global sales
vg[["NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales"]].sum().idxmax()

'NA_Sales'

In [51]:
# sum each region sales
total_NA_Sales = vg["NA_Sales"].sum()
total_JP_Sales = vg["JP_Sales"].sum()
total_Other_Sales = vg["Other_Sales"].sum()
total_EU_Sales = vg["EU_Sales"].sum()

# SUM THE GLOBAL SALES
total_Global_Sales = vg["Global_Sum"].sum()

# calculate the percentage contribution
percentage_global_sales= {
   "NA_Sales": f"{(total_NA_Sales / total_Global_Sales) * 100:.0f}%",
   "JP_Sales": f"{(total_JP_Sales / total_Global_Sales) * 100:.0f}%",
  "Oher_Sales": f"{(total_Other_Sales / total_Global_Sales) * 100:.0f}%",
    "EU_Sales": f"{(total_EU_Sales / total_Global_Sales) * 100:.0f}%"
}

In [52]:
percentage_global_sales

{'NA_Sales': '49%', 'JP_Sales': '14%', 'Oher_Sales': '9%', 'EU_Sales': '27%'}

In [53]:
# percentage_global_sales.to_csv("percentage_global_sales")