# Video Game Sales

Here in this project, we will see what people like when it comes to video games.

![wallpapercave.com](https://wallpapercave.com/wp/RoU39PE.jpg)

## Downloading the Dataset

> - You can find the dataset on this page: https://www.kaggle.com/gregorut/videogamesales
> - Dataset originally was released by [vgchartz](https://www.vgchartz.com/)

In [None]:
!pip install opendatasets --upgrade --quiet

Let's begin by downloading the data, and listing the files within the dataset.

In [None]:
# Change this
dataset_url = 'https://www.kaggle.com/gregorut/videogamesales'

In [None]:
import opendatasets as od
od.download(dataset_url)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: hosseinheydarid98
Your Kaggle Key: ··········
Downloading videogamesales.zip to ./videogamesales


100%|██████████| 381k/381k [00:00<00:00, 23.1MB/s]







The dataset has been downloaded and extracted.

In [None]:
# Change this
data_dir = './videogamesales'

In [None]:
import os
os.listdir(data_dir)

['vgsales.csv']

In [None]:
project_name = "video-game-sales-analysis"

## Data Preparation and Cleaning

Some explanation:
> - NA means North America
> - EU means Europa
> - JP means Japan
> - Sales are in millions


In [None]:
import pandas as pd

In [None]:
sales_df = pd.read_csv('./videogamesales/vgsales.csv')

In [None]:
sales_df

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.00
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.00,31.37
...,...,...,...,...,...,...,...,...,...,...,...
16593,16596,Woody Woodpecker in Crazy Castle 5,GBA,2002.0,Platform,Kemco,0.01,0.00,0.00,0.00,0.01
16594,16597,Men in Black II: Alien Escape,GC,2003.0,Shooter,Infogrames,0.01,0.00,0.00,0.00,0.01
16595,16598,SCORE International Baja 1000: The Official Game,PS2,2008.0,Racing,Activision,0.00,0.00,0.00,0.00,0.01
16596,16599,Know How 2,DS,2010.0,Puzzle,7G//AMES,0.00,0.01,0.00,0.00,0.01


As can be seen, column names start with **capital letters** that are not appropriate, so we should `rename` them.

In [None]:
sales_df = sales_df.rename(columns = {'Rank':'rank','Name':'name','Platform':'platform','Year':'year', 'Genre':'genre','Publisher':'publisher','NA_Sales':'na_sales',
                                      'EU_Sales':'eu_sales','JP_Sales':'jp_sales','Other_Sales':'other_sales', 'Global_Sales':'global_sales'})

In [None]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   rank          16598 non-null  int64  
 1   name          16598 non-null  object 
 2   platform      16598 non-null  object 
 3   year          16327 non-null  float64
 4   genre         16598 non-null  object 
 5   publisher     16540 non-null  object 
 6   na_sales      16598 non-null  float64
 7   eu_sales      16598 non-null  float64
 8   jp_sales      16598 non-null  float64
 9   other_sales   16598 non-null  float64
 10  global_sales  16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB


In [None]:
sales_df.describe()

Unnamed: 0,rank,year,na_sales,eu_sales,jp_sales,other_sales,global_sales
count,16598.0,16327.0,16598.0,16598.0,16598.0,16598.0,16598.0
mean,8300.605254,2006.406443,0.264667,0.146652,0.077782,0.048063,0.537441
std,4791.853933,5.828981,0.816683,0.505351,0.309291,0.188588,1.555028
min,1.0,1980.0,0.0,0.0,0.0,0.0,0.01
25%,4151.25,2003.0,0.0,0.0,0.0,0.0,0.06
50%,8300.5,2007.0,0.08,0.02,0.0,0.01,0.17
75%,12449.75,2010.0,0.24,0.11,0.04,0.04,0.47
max,16600.0,2020.0,41.49,29.02,10.22,10.57,82.74


In [None]:
sales_df.isna().sum()

rank              0
name              0
platform          0
year            271
genre             0
publisher        58
na_sales          0
eu_sales          0
jp_sales          0
other_sales       0
global_sales      0
dtype: int64

In [None]:
sales_df.sort_values('rank').head(20)

Unnamed: 0,rank,name,platform,year,genre,publisher,na_sales,eu_sales,jp_sales,other_sales,global_sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37
5,6,Tetris,GB,1989.0,Puzzle,Nintendo,23.2,2.26,4.22,0.58,30.26
6,7,New Super Mario Bros.,DS,2006.0,Platform,Nintendo,11.38,9.23,6.5,2.9,30.01
7,8,Wii Play,Wii,2006.0,Misc,Nintendo,14.03,9.2,2.93,2.85,29.02
8,9,New Super Mario Bros. Wii,Wii,2009.0,Platform,Nintendo,14.59,7.06,4.7,2.26,28.62
9,10,Duck Hunt,NES,1984.0,Shooter,Nintendo,26.93,0.63,0.28,0.47,28.31


## Exploratory Analysis and Visualization
Let's get to know the data a little bit more.



Let's begin by importing`matplotlib.pyplot`, `seaborn` and `plotly.express`.


In [None]:
import plotly.express as px
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.style.use('dark_background')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

If we plot a histogram to see the distribution of global sales in the first thousand popular games, we can see most sales are around 5 million euros.

In [None]:
top_1000 = sales_df.sort_values('rank').head(1000)

In [None]:
fig = px.histogram(top_1000,
                  x='global_sales',
                  marginal='box',
                  title='Distribution of Sales',
                  template='ggplot2')
fig.update_layout(bargap=0.1, xaxis_title="Global Sales", yaxis_title="Count")
fig.show()

Most sales were around five million dollars among 1000 top games.

Now let's see which **platforms** were popular among people

In [None]:
platform_pop = sales_df.groupby('platform').size().reset_index(name='counts').sort_values(by='counts',ascending = False)

In [None]:
fig = px.pie(platform_pop,
             values='counts',
             names='platform',
             title='Game Production based on Platform')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(margin = dict(l=50,r=50,b=50,t=50,pad=4) ,uniformtext_minsize=10 , uniformtext_mode='hide')
fig.show()

DS, PS2, PS3 were three consoles that 13%, 13%, and 8.01% of all games created are for them, respectively.

Game sales for different platforms vary; let's look at it in detail.

In [None]:
platform_sale = sales_df.groupby('platform').sum().reset_index().sort_values(by='global_sales',ascending = False)

In [None]:
fig = px.bar(platform_sale.rename(columns = {'na_sales':'North America Sales','eu_sales':'Europe Sales','jp_sales':'Japan Sales','other_sales':'Other Sales'}),
                                             x='platform', y=['North America Sales','Europe Sales','Japan Sales','Other Sales'], labels={'value':'Sales (in million)','platform':'Platforms'})
fig.update_layout(legend_title_text='Regions')
fig.show()

In sales, PS2, X360, and PS3 were more successful. Also, North America was their primary market.

Now let's see which **genres** were popular among people

In [None]:
genre_pop = sales_df.groupby('genre').size().reset_index(name='counts').sort_values(by='counts',ascending = False)

In [None]:
fig = px.pie(genre_pop,
             values='counts',
             names='genre',
             title='Game Production based on Genre')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(margin = dict(l=50,r=50,b=50,t=50,pad=4) ,uniformtext_minsize=10 , uniformtext_mode='hide')
fig.show()

Action games with 20 percent and Sports games with 14.1 percent were the two most frequent genres.

Game sales for different genres vary; let's look at it in detail.

In [None]:
genre_sale = sales_df.groupby('genre').sum().reset_index().sort_values(by='global_sales',ascending = False)

In [None]:
fig = px.bar(genre_sale.rename(columns = {'na_sales':'North America Sales','eu_sales':'Europe Sales','jp_sales':'Japan Sales','other_sales':'Other Sales'}),
                                             x='genre', y=['North America Sales','Europe Sales','Japan Sales','Other Sales'], labels={'value':'Sales (in million)','genre':'Genres'})
fig.update_layout(legend_title_text='Genres')
fig.show()

In sales, Action, Sports, and Shooter were more successful.

## Asking and Answering Questions

Now we want to delve deeper into the data and understand it more comprehensively



#### Q1: Do video games become more popular in recent years?

In [None]:
year_sale = sales_df.groupby('year').sum().reset_index().sort_values(by='year')[year_sale.year <= 2014]

In [None]:
year_sale

Unnamed: 0,year,rank,na_sales,eu_sales,jp_sales,other_sales,global_sales
0,1980.0,29826,10.59,0.67,0.0,0.12,11.38
1,1981.0,190488,33.4,1.96,0.0,0.32,35.77
2,1982.0,149186,26.92,1.65,0.0,0.31,28.86
3,1983.0,56759,7.76,0.8,8.1,0.14,16.79
4,1984.0,22911,33.28,2.1,14.27,0.7,50.36
5,1985.0,55505,33.73,4.74,14.56,0.92,53.94
6,1986.0,35986,12.5,2.84,19.81,1.93,37.07
7,1987.0,54701,8.46,1.41,11.63,0.2,21.74
8,1988.0,37181,23.87,6.59,15.76,0.99,47.22
9,1989.0,40156,45.15,8.44,18.36,1.5,73.45


In [None]:
fig = px.line(year_sale.rename(columns = {'na_sales':'North America Sales','eu_sales':'Europe Sales','jp_sales':'Japan Sales','other_sales':'Other Sales'}),
                                             x='year', y=['North America Sales','Europe Sales','Japan Sales','Other Sales'], labels={'value':'Sales (in million)','year':'Years'},
              template = 'simple_white')
fig.update_layout(legend_title_text='Regions')
fig.show()

According to the graph, the popularity of video games increases with time.

#### Q2: How were the trends for the three most popular platforms between 2000 and 2015?

In [None]:
platform_over_year_sales = sales_df.groupby(['platform','year']).sum().reset_index().sort_values(by='year')[platform_over_year_sales.year >= 2000][platform_over_year_sales.year <= 2015]


Boolean Series key will be reindexed to match DataFrame index.



In [None]:
top_3_platforms = list(platform_pop.platform[:3])

In [None]:
platform_over_year_sales = platform_over_year_sales[platform_over_year_sales.platform.isin(top_3_platforms)]

In [None]:
fig = px.line(platform_over_year_sales, x="year", y="global_sales", color="platform", line_shape="spline", render_mode="svg", labels={'global_sales':'Sales (in million)','year':'Years'},
              template = 'simple_white')
fig.update_layout(legend_title_text='Platforms')
fig.show()

Platforms experienced the same pattern. They experienced a pick three years of their introduction and were out of date after ten years.

#### Q3: How were the trends for the three most popular genres between 2000 and 2015?

In [None]:
genre_over_year_sales = sales_df.groupby(['genre','year']).sum().reset_index().sort_values(by='year')[genre_over_year_sales.year >= 2000][genre_over_year_sales.year <= 2015]


Boolean Series key will be reindexed to match DataFrame index.



In [None]:
top_3_genres = list(genre_pop.genre[:3])

In [None]:
genre_over_year_sales = genre_over_year_sales[genre_over_year_sales.genre.isin(top_3_genres)]

In [None]:
fig = px.line(genre_over_year_sales, x="year", y="global_sales", color="genre", line_shape="spline", render_mode="svg", labels={'global_sales':'Sales (in million)','year':'Years'},
              template = 'simple_white')
fig.update_layout(legend_title_text='Genre')
fig.show()

Action games were always an interesting genre. Sports and Misc experienced an increase in 2003, and their popularity dropped in 2012.

#### Q4: Which publishers were more successful over these years?

In [None]:
publisher_sale = sales_df.groupby('publisher').sum().reset_index().sort_values(by='global_sales',ascending = False)[:50]

In [None]:
fig = px.bar(publisher_sale.rename(columns = {'na_sales':'North America Sales','eu_sales':'Europe Sales','jp_sales':'Japan Sales','other_sales':'Other Sales'}),
                                             x='publisher', y=['North America Sales','Europe Sales','Japan Sales','Other Sales'], labels={'value':'Sales (in million)','publisher':'Publisher'},
              template = 'simple_white')
fig.update_layout(legend_title_text='Region')
fig.show()

The three leading producers of video games were Nintendo, Electronic Arts, and Activision.

#### Q5: In each region, which platforms sold the most?

In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [None]:
na = sales_df.groupby('platform').sum().sort_values('na_sales',ascending = False).head(5).reset_index()[['platform','na_sales','global_sales']]
eu = sales_df.groupby('platform').sum().sort_values('eu_sales',ascending = False).head(5).reset_index()[['platform','eu_sales','global_sales']]
jp = sales_df.groupby('platform').sum().sort_values('jp_sales',ascending = False).head(5).reset_index()[['platform','jp_sales','global_sales']]
other = sales_df.groupby('platform').sum().sort_values('other_sales',ascending = False).head(5).reset_index()[['platform','other_sales','global_sales']]

In [None]:
fig = make_subplots(rows=2, cols=2)
fig.add_trace(
    go.Bar(x= list(na.platform), y= list(na.na_sales), name="North America"),
    row=1, col=1
)

fig.add_trace(
    go.Bar(x= list(eu.platform), y= list(eu.eu_sales), name="Europe"),
    row=1, col=2
)

fig.add_trace(
    go.Bar(x= list(jp.platform), y= list(jp.jp_sales), name="Japan"),
    row=2, col=1
)

fig.add_trace(
    go.Bar(x= list(other.platform), y= list(other.other_sales), name="Other"),
    row=2, col=2
)

fig.update_layout(height=600, width=800)
fig.show()

Video game sales were highest for the X360 in North America, the PS3 in Europe, the DS in Japan, and the PS2 in other parts of the world.

## Inferences and Conclusion

In conclusion, people in North America were more interested in video games regarding sales figures in this region. Platform choices vary in different areas, but people's prevalent choice is Play Station and Xbox. Finally, the most popular genre was Action which shows that all people like violence deep inside.