# Video Game Sales

## Introduction

<p>In this project, I am working with video games sales dataset. This dataset consists of user and expert reviews, genres, platforms (e.g. Xbox or PlayStation), and historical data on game sales that are from open sources. The dataset also contains ESRB (Entertainment Sofrware Rating Board) age ratings, which are based on a game's content. I will be analzing this data to identify patterns that determine whether a game succeeds or not. This analysis will be used to spot potential big winners and plan advertising campaigns based on these findings.</p>

<p>Dataset column descriptions: platform, release year, genre, sales in USD millions (North American, Europe, Japan & all other countries), Critic Score, User Score & Rating.</p>

### Stages
My project consists of the following stages:
1. Introduction
2. Data Overview
3. Data Preprocessing
4. Feature Engineering
5. Data Analysis & Plotting
6. Regional User Profiles
7. Statistical Hypotheses Testing
8. Conclusion

## Data Overview

In [1]:
# load all the libraries
from scipy import stats as st
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import plotly_express as px

In [2]:
# load that data file
df = pd.read_csv('https://practicum-content.s3.us-west-1.amazonaws.com/datasets/moved_games.csv')

In [3]:
# general info about dataframe
df.info()
df.sample(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16713 non-null  object 
 1   Platform         16715 non-null  object 
 2   Year_of_Release  16446 non-null  float64
 3   Genre            16713 non-null  object 
 4   NA_sales         16715 non-null  float64
 5   EU_sales         16715 non-null  float64
 6   JP_sales         16715 non-null  float64
 7   Other_sales      16715 non-null  float64
 8   Critic_Score     8137 non-null   float64
 9   User_Score       10014 non-null  object 
 10  Rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB


Unnamed: 0,Name,Platform,Year_of_Release,Genre,NA_sales,EU_sales,JP_sales,Other_sales,Critic_Score,User_Score,Rating
5685,Chicken Blaster,Wii,2009.0,Shooter,0.29,0.0,0.0,0.02,,tbd,T
6599,Ice Age 2: The Meltdown,GBA,2006.0,Platform,0.18,0.07,0.0,0.0,,tbd,E
12444,Akai Katana Shin,X360,2011.0,Shooter,0.04,0.0,0.01,0.0,,,
9716,Chessmaster: The Art of Learning,PSP,2008.0,Misc,0.11,0.0,0.0,0.01,64.0,8.4,E
3757,007 Racing,PS,2000.0,Racing,0.3,0.2,0.0,0.03,51.0,4.6,T
7465,Wacky Races: Crash & Dash,DS,2008.0,Racing,0.19,0.0,0.0,0.01,,,
9431,Cabela's Dangerous Hunts 2009,X360,2008.0,Sports,0.12,0.0,0.0,0.01,42.0,tbd,T
15179,Toxic Grind,XB,2002.0,Sports,0.02,0.0,0.0,0.0,49.0,tbd,T
10878,Altered Beast: Guardian of the Realms,GBA,2002.0,Action,0.07,0.02,0.0,0.0,63.0,4.8,T
3084,LEGO Jurassic World,XOne,2015.0,Action,0.38,0.22,0.0,0.06,70.0,6.7,E10+


### Conclusion
The column names will need to be changed to lower-case to fit naming conventions. I will also need to change the data types for the year of release and user score columns. There are missing values in many of the columns. I will need to investigate these values and choose the appropriate method to deal with them. I will also need to double check for any duplicates in the data. 

## Data Preprocessing

In [4]:
# change column names to lowercase
df.columns = df.columns.str.lower()
# check changes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             16713 non-null  object 
 1   platform         16715 non-null  object 
 2   year_of_release  16446 non-null  float64
 3   genre            16713 non-null  object 
 4   na_sales         16715 non-null  float64
 5   eu_sales         16715 non-null  float64
 6   jp_sales         16715 non-null  float64
 7   other_sales      16715 non-null  float64
 8   critic_score     8137 non-null   float64
 9   user_score       10014 non-null  object 
 10  rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB


In [5]:
# change values in columns to lowercase
df.name = df.name.str.lower()
df.genre = df.genre.str.lower()
df.platform = df.platform.str.lower()
df.rating = df.rating.str.lower()

In [6]:
# check for any duplicates in df
df.duplicated().sum()

0

In [7]:
# check for rows with duplicate names
print(df[df.name.duplicated(keep=False)==True].sort_values(by='name'))
print(df.name.duplicated().sum())

                             name platform  year_of_release     genre  \
3862         frozen: olaf's quest       ds           2013.0  platform   
3358         frozen: olaf's quest      3ds           2013.0  platform   
3120       007: quantum of solace      wii           2008.0    action   
1785       007: quantum of solace      ps3           2008.0    action   
1285       007: quantum of solace     x360           2008.0    action   
...                           ...      ...              ...       ...   
12439          zumba fitness core      wii           2012.0      misc   
7137   zumba fitness: world party      wii           2013.0      misc   
6878   zumba fitness: world party     xone           2013.0      misc   
659                           NaN      gen           1993.0       NaN   
14244                         NaN      gen           1993.0       NaN   

       na_sales  eu_sales  jp_sales  other_sales  critic_score user_score  \
3862       0.21      0.26      0.00         0.

**Notes**<br>
Almost half of the dataset consists of rows with duplicated titles in the name column. Each row has sales totals for that title on a different platform, so these are not complete duplicates and should be left alone. 

In [8]:
# investigate missing values in name column
df[df.name.isna()==True]

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
659,,gen,1993.0,,1.78,0.53,0.0,0.08,,,
14244,,gen,1993.0,,0.0,0.0,0.03,0.0,,,


In [9]:
# fill missing values with unknown
df.name.fillna('untitled',inplace=True)
# check changes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             16715 non-null  object 
 1   platform         16715 non-null  object 
 2   year_of_release  16446 non-null  float64
 3   genre            16713 non-null  object 
 4   na_sales         16715 non-null  float64
 5   eu_sales         16715 non-null  float64
 6   jp_sales         16715 non-null  float64
 7   other_sales      16715 non-null  float64
 8   critic_score     8137 non-null   float64
 9   user_score       10014 non-null  object 
 10  rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.name.fillna('untitled',inplace=True)


**Notes**<br>
There were only 2 games with missing titles. Since this was such a small proportion of the total data, I renamed these games as untitled. These games could just be untitled games or there could have been issues in the collection of data. 

In [10]:
# check unique values for genre
print(df.genre.unique())
# investigate missing values for genre
df[df.genre.isna()==True]

['sports' 'platform' 'racing' 'role-playing' 'puzzle' 'misc' 'shooter'
 'simulation' 'action' 'fighting' 'adventure' 'strategy' nan]


Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
659,untitled,gen,1993.0,,1.78,0.53,0.0,0.08,,,
14244,untitled,gen,1993.0,,0.0,0.0,0.03,0.0,,,


**Notes**<br>
There were only 2 games with missing genres, I decided to leave these values as blank since they are such a small proportion of the dataset and so it shouldn't affect my analysis. 

In [11]:
# investigate missing values in year_of_release column
df[df.year_of_release.isna()]

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
183,madden nfl 2004,ps2,,sports,4.26,0.26,0.01,0.71,94.0,8.5,e
377,fifa soccer 2004,ps2,,sports,0.59,2.36,0.04,0.51,84.0,6.4,e
456,lego batman: the videogame,wii,,action,1.80,0.97,0.00,0.29,74.0,7.9,e10+
475,wwe smackdown vs. raw 2006,ps2,,fighting,1.57,1.02,0.00,0.41,,,
609,space invaders,2600,,shooter,2.36,0.14,0.00,0.03,,,
...,...,...,...,...,...,...,...,...,...,...,...
16373,pdc world championship darts 2008,psp,,sports,0.01,0.00,0.00,0.00,43.0,tbd,e10+
16405,freaky flyers,gc,,racing,0.01,0.00,0.00,0.00,69.0,6.5,t
16448,inversion,pc,,shooter,0.01,0.00,0.00,0.00,59.0,6.7,m
16458,hakuouki: shinsengumi kitan,ps3,,adventure,0.01,0.00,0.00,0.00,,,


In [12]:
# check all games released with the title space invaders
df[df.name=='space invaders']

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
609,space invaders,2600,,shooter,2.36,0.14,0.0,0.03,,,
4264,space invaders,snes,1994.0,shooter,0.0,0.0,0.46,0.0,,,
8580,space invaders,n64,1999.0,shooter,0.13,0.03,0.0,0.0,,,
10383,space invaders,gba,2002.0,shooter,0.08,0.03,0.0,0.0,,,


In [13]:
# check all games released on the 2600 platform
df[df.platform=='2600']

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
89,pac-man,2600,1982.0,puzzle,7.28,0.45,0.0,0.08,,,
240,pitfall!,2600,1981.0,platform,4.21,0.24,0.0,0.05,,,
262,asteroids,2600,1980.0,shooter,4.00,0.26,0.0,0.05,,,
546,missile command,2600,1980.0,shooter,2.56,0.17,0.0,0.03,,,
609,space invaders,2600,,shooter,2.36,0.14,0.0,0.03,,,
...,...,...,...,...,...,...,...,...,...,...,...
8741,klax,2600,1989.0,puzzle,0.14,0.01,0.0,0.00,,,
9095,krull,2600,1982.0,action,0.13,0.01,0.0,0.00,,,
9487,realsports volleyball,2600,1981.0,sports,0.12,0.01,0.0,0.00,,,
11747,super football,2600,1987.0,sports,0.07,0.00,0.0,0.00,,,


**Notes**<br>
Since there is not a way to accurately fill the missing values for the release year column, I will have to drop the rows with missing values. It is less than 2% of the total data, so dropping these rows with missing release years, should not have a significant impact on final analysis. 

In [14]:
# create series of release year average for each platform in the dataset
platform_year = df.groupby(['platform'])['year_of_release'].mean()
platform_year = round(platform_year)
platform_year = platform_year.astype(int)
platform_year

platform
2600    1982
3do     1995
3ds     2013
dc      2000
ds      2008
gb      1996
gba     2003
gc      2003
gen     1993
gg      1992
n64     1999
nes     1987
ng      1994
pc      2009
pcfx    1996
ps      1998
ps2     2005
ps3     2011
ps4     2015
psp     2009
psv     2014
sat     1996
scd     1994
snes    1994
tg16    1995
wii     2009
wiiu    2014
ws      2000
x360    2010
xb      2004
xone    2015
Name: year_of_release, dtype: int64

In [15]:
# remove rows with missing values in year_of_release column
df = df[~df.year_of_release.isna()]

In [16]:
# check unique values in critic score column
print(df.critic_score.unique())
# descriptive stats on critic scores
df.critic_score.describe()

[76. nan 82. 80. 89. 58. 87. 91. 61. 97. 95. 77. 88. 83. 94. 93. 85. 86.
 98. 96. 90. 84. 73. 74. 78. 92. 71. 72. 68. 62. 49. 67. 81. 66. 56. 79.
 70. 59. 64. 75. 60. 63. 69. 50. 25. 42. 44. 55. 48. 57. 29. 47. 65. 54.
 20. 53. 37. 38. 33. 52. 30. 32. 43. 45. 51. 40. 46. 39. 34. 41. 36. 31.
 27. 35. 26. 19. 28. 23. 24. 21. 17. 13.]


count    7983.000000
mean       68.994363
std        13.920060
min        13.000000
25%        60.000000
50%        71.000000
75%        79.000000
max        98.000000
Name: critic_score, dtype: float64

**Notes**<br>
There is such a large proportion of the data that are missing the critic score that if I fill these missing values, this could greatly skew the analysis and can affect statistical testing and the modeling. I am not removing these rows since they still contain lots of other data that still could be beneficial to my analysis. 

In [17]:
# check unique values in user_score
print(df.user_score.unique())
df[df.user_score=='tbd']

['8' nan '8.3' '8.5' '6.6' '8.4' '8.6' '7.7' '6.3' '7.4' '8.2' '9' '7.9'
 '8.1' '8.7' '7.1' '3.4' '5.3' '4.8' '3.2' '8.9' '6.4' '7.8' '7.5' '2.6'
 '7.2' '9.2' '7' '7.3' '4.3' '7.6' '5.7' '5' '9.1' '6.5' 'tbd' '8.8' '6.9'
 '9.4' '6.8' '6.1' '6.7' '5.4' '4' '4.9' '4.5' '9.3' '6.2' '4.2' '6' '3.7'
 '4.1' '5.8' '5.6' '5.5' '4.4' '4.6' '5.9' '3.9' '3.1' '2.9' '5.2' '3.3'
 '4.7' '5.1' '3.5' '2.5' '1.9' '3' '2.7' '2.2' '2' '9.5' '2.1' '3.6' '2.8'
 '1.8' '3.8' '0' '1.6' '9.6' '2.4' '1.7' '1.1' '0.3' '1.5' '0.7' '1.2'
 '2.3' '0.5' '1.3' '0.2' '0.6' '1.4' '0.9' '1' '9.7']


Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
119,zumba fitness,wii,2010.0,sports,3.45,2.59,0.0,0.66,,tbd,e
301,namco museum: 50th anniversary,ps2,2005.0,misc,2.08,1.35,0.0,0.54,61.0,tbd,e10+
520,zumba fitness 2,wii,2011.0,sports,1.51,1.03,0.0,0.27,,tbd,t
645,udraw studio,wii,2010.0,misc,1.65,0.57,0.0,0.20,71.0,tbd,e
718,just dance kids,wii,2010.0,misc,1.52,0.54,0.0,0.18,,tbd,e
...,...,...,...,...,...,...,...,...,...,...,...
16695,planet monsters,gba,2001.0,action,0.01,0.00,0.0,0.00,67.0,tbd,e
16697,bust-a-move 3000,gc,2003.0,puzzle,0.01,0.00,0.0,0.00,53.0,tbd,e
16698,mega brain boost,ds,2008.0,puzzle,0.01,0.00,0.0,0.00,48.0,tbd,e
16704,plushees,ds,2008.0,simulation,0.01,0.00,0.0,0.00,,tbd,e


In [18]:
# change user_score columne to numeric
df.user_score = pd.to_numeric(df.user_score,errors='coerce')
# descriptive stats for user_score 
df.user_score.describe()

count    7463.000000
mean        7.126330
std         1.499447
min         0.000000
25%         6.400000
50%         7.500000
75%         8.200000
max         9.700000
Name: user_score, dtype: float64

**Notes**<br>
The games that had 'tbd' listed as their user_score are probably games that have yet to receive a review. The titles with missing values could also be games that have yet to receive a review. There could have also been issues in the data collection process that led to the incomplete data. Since it was such a large proportion of the data that were missing values, I did not change them. I also did not drop the rows with the missing user scores as they still contain lots of other data that will be beneficial to my analysis. 

In [19]:
# check unique values in ratings column
df.rating.unique()

array(['e', nan, 'm', 't', 'e10+', 'k-a', 'ao', 'ec', 'rp'], dtype=object)

In [20]:
# investigate missing values in ratings column
df[df.rating.isna()]

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
1,super mario bros.,nes,1985.0,platform,29.08,3.58,6.81,0.77,,,
4,pokemon red/pokemon blue,gb,1996.0,role-playing,11.27,8.89,10.22,1.00,,,
5,tetris,gb,1989.0,puzzle,23.20,2.26,4.22,0.58,,,
9,duck hunt,nes,1984.0,shooter,26.93,0.63,0.28,0.47,,,
10,nintendogs,ds,2005.0,simulation,9.05,10.95,1.93,2.74,,,
...,...,...,...,...,...,...,...,...,...,...,...
16710,samurai warriors: sanada maru,ps3,2016.0,action,0.00,0.00,0.01,0.00,,,
16711,lma manager 2007,x360,2006.0,sports,0.00,0.01,0.00,0.00,,,
16712,haitaka no psychedelica,psv,2016.0,adventure,0.00,0.00,0.01,0.00,,,
16713,spirits & spells,gba,2003.0,platform,0.01,0.00,0.00,0.00,,,


**Notes**<br>
Since there is no accurate way to fill these missing values, I will leave these values as is. I also will not drop the rows missing this data as they contain lots of other data that could be beneficial to my analysis. 

### Conclusion

<p>All of the missing & duplicate values have been investigated, I went ahead and dropped the rows that were missing the release year data. All of the other rows with missing values I left alone as I was not able to accurately fill those missing values. The missing values for the release year column was less than 2% of the total data, so dropping those rows should not have any great affect on my analysis. On the other hand, about 40% of the data is missing the rating. This means that I will have to be conscientious of this while performing analysis involving the different rating categories. 

## Feature Engineering

In [21]:
# create total sales column
df.insert(loc=8,column='tot_sales',value=df.na_sales+df.eu_sales+df.jp_sales+df.other_sales)
df.head()


Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,tot_sales,critic_score,user_score,rating
0,wii sports,wii,2006.0,sports,41.36,28.96,3.77,8.45,82.54,76.0,8.0,e
1,super mario bros.,nes,1985.0,platform,29.08,3.58,6.81,0.77,40.24,,,
2,mario kart wii,wii,2008.0,racing,15.68,12.76,3.79,3.29,35.52,82.0,8.3,e
3,wii sports resort,wii,2009.0,sports,15.61,10.93,3.28,2.95,32.77,80.0,8.0,e
4,pokemon red/pokemon blue,gb,1996.0,role-playing,11.27,8.89,10.22,1.0,31.38,,,


In [22]:
# create function to create age-category for games based on year_of_release
def age_category(x):
    age = 2016-x
    if 0 < age <=5:
        return '0-5'
    elif 5 < age <=10:
        return '6-10'
    elif 10 < age <= 15:
        return '11-15'
    elif 15 < age <= 20:
        return '16-20'
    elif 20 < age <= 25:
        return '21-25'
    else:
        return '>25'

In [23]:
# create age_category 
df['age_category'] = df.year_of_release.apply(age_category)
# check changes
df.head()

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,tot_sales,critic_score,user_score,rating,age_category
0,wii sports,wii,2006.0,sports,41.36,28.96,3.77,8.45,82.54,76.0,8.0,e,6-10
1,super mario bros.,nes,1985.0,platform,29.08,3.58,6.81,0.77,40.24,,,,>25
2,mario kart wii,wii,2008.0,racing,15.68,12.76,3.79,3.29,35.52,82.0,8.3,e,6-10
3,wii sports resort,wii,2009.0,sports,15.61,10.93,3.28,2.95,32.77,80.0,8.0,e,6-10
4,pokemon red/pokemon blue,gb,1996.0,role-playing,11.27,8.89,10.22,1.0,31.38,,,,16-20


### Conclusion
Using the individual sales columns from the different regions, I was able to create a new column with the total sales. I also used the year_of_release column to create a new column that puts each game into an age category. 

## Analysis & Plotting

In [24]:
# plot histogram to show frequency distribution of release years for titles
hist1 = px.histogram(df,x='year_of_release',title='Number of Game Titles Released Each Year',
                     labels={'year_of_release':'Game Release Year'})
hist1.update_layout(title_font_size=20,bargap=0.1,
                    height=500)
hist1.show()


In [25]:
hist2 = px.histogram(df,x='age_category',title='Number of Games Released According to Age Category',
                     labels={'age_category':'Age Category (in years)'},
                     color='age_category',
                     category_orders=dict(age_category=['0-5','6-10','11-15','16-20','21-25','>25']))
hist2.update_layout(title_font_size=20,showlegend=False)
hist2.show()

#### Conclusion
The distribution of the number of games that were released each year is left-skewed. This makes sense as the first video game consoles were not popularized until the 1980s. As the popularity of video games grew as time passed, it would make sense that more video games would be produced. The majority of the distribution is of games produced after 2000, with a smaller distribution of the data being from the most recent years (2012 and on). The histogram of the number of games produced by age category, reflects the same information as the previous histogram, but it is easier to see that the biggest proportion of the data is from games that were released 6-10 years before 2016. 

In [28]:
# distribution of sales across each platform
platform_sales = df.groupby(by=['platform'])['tot_sales'].sum().sort_values(ascending=False)
platform_sales
bar1 = px.bar(platform_sales,title='Total Sales by Platform',
              labels={'platform':'Platform','value':'Global Sales (USD Million)'},
              color=platform_sales.index)
bar1.update_layout(title_font_size=20,height=500,
                   showlegend=False)
bar1.update_xaxes(tickangle=45)
bar1.show()

#### Conclusion
The consoles from the data that have the highest total sales are the Playstation 2, Xbox 360, Playstation 3, Wii, Nintendo DS & Playstation. You can see that there is a clear top 6 platforms and then the total sales for next highest selling platform decreases by almost half. I will therefore choose to work with the top 6 platforms to compare each platform's distribution of total sales by release year. 

In [29]:
# create a new df containing only the games from the top platforms with the most sales
top_platform_list = ['ps2','x360','ps3','wii','ds','ps']
top_platform_df = df[df.platform.isin(top_platform_list)]


In [30]:
# distribution of platform sales each year for the top performing platforms
hist3 = px.histogram(top_platform_df,x='year_of_release', 
                     title='Distribution of Total Yearly Sales by Platform (Overlay)',
                     labels={'year_of_release':'Release Year','tot_sales':'Global Sales (USD Million)'},
                     color='platform',y='tot_sales',
                     barmode='overlay',opacity=0.8)
hist3.update_layout(title_font_size=20,legend_title_text='Platform',
                    height=500)
hist3.show()

In [33]:
# distribution of all platform sales each year in stacked histogram
hist4 = px.histogram(df,x='year_of_release',
                     title='Distribution of Total Yearly Sales Across All Platforms (Stacked)',
                     color='platform',y='tot_sales',
                     labels={'year_of_release':'Release Year','tot_sales':'Global Sales (USD Million)'})
hist4.update_layout(title_font_size=20,legend_title_text='Platform',
                    height=500)
hist4.show()

#### Conclusion

<p>There are many platforms, such as the Playstation & Game Boy Advance, that used to be very popular but now have zero sales. You can even see that there are platforms, such as the Playstation 3 & Xbox 360, whose total sales amounts are now falling quickly. This clearly shows that most of the video game platforms have a sort of life cycle, where they are introduced, then popularized, then become outdated and are replaced by a newer & better game platform that is introduced to the market.</p>

<p>It is also interesting to see that platforms are often introduced in the same year as another platform. This clearly demonstrates the competing nature of the different platforms. Generally speaking, it does appear that the platforms have a "lifespan" of about 10 years, starting from when they are introduced to the market until when they fade in popularity & use. It also appears that a new platform is introduced every 7 years or so, at which point, the total sales from previous platform begins to decline as the total sales from the newer platform begins to rise.</p>

<p>Taking this into consideration, I will focus on the time period from 2013 to 2016. I feel that starting at 2013 is a good choice because this is the point at which the older platforms (such as the ps3, x360, and wii) are starting to fade, while the new platforms (ps4, xone) are starting to emerge in popularity & sales.</p>

In [34]:
# filter out data to only work with, disregard the data for previous years
filtered_df = df[(df.year_of_release>=2013)]

In [35]:
# distribution of total sales each year for each platform
hist5 = px.histogram(filtered_df,x='year_of_release', color='platform',y='tot_sales',
                     labels={'year_of_release':'Release Year','tot_sales':'Global Sales (in USD Million)'},
                     title='Total Sales by Year by Platform',
                     nbins=4)
hist5.update_layout(title_font_size=20,legend_title_text='Platform', height=500)
hist5.update_xaxes(ticktext=[2013,2014,2015,2016])
hist5.show()

#### Conclusion
The most popular platforms as the Playstation 3, Playstation 4, Xbox360 and Xbox One. It is clear to see in the histogram that the sales for the Playstation 3 & Xbox 360 are declining, while the sales for the Playstation 4 & Xbox One are on the rise. This leads me to believe that the Playstation 4 & Xbox One will be potentially profitable platforms for 2017. 

In [36]:
# create boxplot of global sales for all platforms for filtered time period
box1 = px.box(filtered_df,x='tot_sales',y='platform',color='platform',
              title='Total Global Sales (USD Million) by Platform (2013-2016)',
              labels={'platform':'Platform','tot_sales':'Global Sales (USD Million)'})
box1.update_layout(title_font_size=20,height=800,boxgap=0.01,showlegend=False)
box1.update_traces(boxmean=True)
box1.show()

In [37]:
# zoomed in boxplot of global sales for all platforms for filtered time period
box1 = px.box(filtered_df,x='tot_sales',y='platform',color='platform',
              title='Total Global Sales (USD Million) by Platform (2013-2016)',
              labels={'platform':'Platform','tot_sales':'Global Sales (USD Million)'})
box1.update_layout(title_font_size=20,height=800,boxgap=0.01,showlegend=False)
box1.update_traces(boxmean=True)
box1.update_xaxes(range=(0,2.5))
box1.show()

In [38]:
# create df of top 5 platforms for filtered df
top_platform_filtered_list = filtered_df.groupby('platform')['tot_sales'].sum().sort_values(ascending=False).iloc[0:5]
top_platform_filtered_df = filtered_df[filtered_df.platform.isin(top_platform_filtered_list.index)]

In [39]:
# boxplot of total sales for the top 5 performing platforms in 2013-2016 time period
box2 = px.box(top_platform_filtered_df,x='tot_sales',y='platform',color='platform',
              title='Total Global Sales for the Top 5 Performing Platforms (2013-2016)',
              labels={'platform':'Platform','tot_sales':'Global Sales (USD Million)'})
box2.update_layout(title_font_size=20,height=500,showlegend=False,
                   boxgap=0.01)
box2.update_traces(boxmean=True)
box2.show()

In [40]:
# zoomed in boxplot of total sales for the top 5 performing platforms in 2013-2016 time period
box3 = px.box(top_platform_filtered_df,x='tot_sales',y='platform',color='platform',
              title='Total Global Sales for the Top 5 Performing Platforms (2013-2016)',
              labels={'platform':'Platform','tot_sales':'Global Sales (USD Million)'})
box3.update_layout(title_font_size=20,height=500,showlegend=False,
                   boxgap=0.01)
box3.update_traces(boxmean=True)
box3.update_xaxes(range=(0,3))
box3.show()

#### Conclusion
<p>The first boxplot shows that most of the sales distributions for each platform in the period from 2013-2016 is heavily skewed with many outliers. It is interesting to see how most of the distributions have such large outliers and often quite a few of these outliers. The distributions also shows that over half of the games released in this period had less than 0.3M USD in sales, as all of the medians of the distributions are less than $0.3M. From there the distribution of sales start to vary quite a bit from platform to platform with the more popular platforms having more spread in their distributions. These top performing platforms also have much higher value outliers as well that really contribute greatly to the total sales for these platforms in this time period.</p>

<p>The distribution means vary quite a bit from platform to platform. This makes sense as the large value outliers for the platforms can increase the average by quite a bit. Here we can see that all of the handhelp platforms have smaller average sales versus the other platforms (such as the ps3 or xbox360). When comparing the means of the top performing platforms, they all appear to be relatively similar, but I would have to perform statistical testing to confirm.</p>

<p>Overall these distributions make sense as most platforms will have many games produced each year, but only a few games will be huge hits that make millions of dollars in sales. </p>


In [41]:
# create new df that only contains games on the ps3 platform
ps3_df = df[df.platform=='ps3']

In [42]:
# scatterplot of user scores & total sales for games released for the ps3
scatter1 = px.scatter(ps3_df,x='tot_sales',y='user_score',color='genre',
                      title='User Score vs. Total Global Sales (Playstation 3 Games)',
                      labels={'user_score':'User Score','tot_sales':'Global Sales (USD Million)'})
scatter1.update_layout(title_font_size=20,legend_title_text='Genre',height=500)
scatter1.show()

In [43]:
# scatterplot of user scores & total sales for ps3 games across other platforms
other_platforms = df[(df.name.isin(ps3_df.name))]
scatter1 = px.scatter(other_platforms,x='tot_sales',y='user_score',color='platform',
                      title='User Score vs. Total Global Sales',
                      labels={'user_score':'User Score','tot_sales':'Global Sales (USD Million)'})
scatter1.update_layout(title_font_size=20,legend_title_text='Platform',height=500)
scatter1.show()

In [44]:
# calculate correlation between user_score and tot_sales (Playstation 3)
ps3_df.user_score.corr(ps3_df.tot_sales)

0.12841562938563006

In [45]:
# calculate correlation between user_score and tot_sales for all games
other_platforms.user_score.corr(other_platforms.tot_sales)

0.09513632095684908

#### Conclusion
<p>Looking at the scatterplot there does seem to be a weak positive correlation between user scores and the total sales for these Playstation 3 games. This is also reflected with the calculated Pearson correlation coefficient of 0.128. This seems to make sense as good user ratings & reviews could help drive sales for highly rated games. Also if many users rate a game poorly, this could deter future gamers from purchasing the game. 

In [46]:
# scatterplot of user & critic ratings for the ps3 & sales
scatter2 = px.scatter(ps3_df,x='tot_sales',y='critic_score',color='genre',
                      title='Critic Score vs. Total Global Sales (Playstation 3 Games)',
                      labels={'tot_sales':'Total Global Sales (USD Million)',
                              'critic_score':'Critic Score'})
scatter2.update_layout(title_font_size=20,legend_title_text='Genre',height=500)
scatter2.show()

In [47]:
# scatterplot of user & critic ratings for the all game sales
scatter2 = px.scatter(other_platforms,x='tot_sales',y='critic_score',color='platform',
                      title='Critic Score vs. Total Global Sales',
                      labels={'tot_sales':'Total Global Sales (USD Million)',
                              'critic_score':'Critic Score'})
scatter2.update_layout(title_font_size=20,legend_title_text='Platform',height=500)
scatter2.show()

In [48]:
# calculate correlation between critic score and tot sales for ps3
ps3_df.critic_score.corr(ps3_df.tot_sales)

0.4327589578997136

In [49]:
# calculate correlation between critic score and tot sales for all games
other_platforms.critic_score.corr(other_platforms.tot_sales)

0.3761026881235178

#### Conclusion
<p> There does appear to be a positive correlation between the critic scores and the total sales for Playstation 3 games. The correlation does look stronger than that of the user score and total sales. This is also reflected through the calculated Pearson correlation coefficient of 0.42. Once again this isn't the strongest correlation but it is a stronger correlation than the previous scatterplot. This also makes sense as once again good critic ratings can help drive the sales of a game and poor critic ratings can possibly deter future customers from purchasing the game. It also makes sense that this correlation is stronger than that of the user ratings & total sales. As critics have a better platform to share their opinions and their reviews are deemed as more trusted than that of an average user, their ratings probably have a stronger impact on influencing someone's choice to purchase a certain game. </p>
<p> Overall the correlation between the critic and user scores and the total sales for Playstation 3 games is much stronger than the same games across all of the platforms. These correlations are still overall fairly weak correlations, but it does seem like the user scores and critic scores affect sales for the PS3 more so than across other platforms. 

In [50]:
# compare the sales of the same games on other platforms
# filter out top 20 rated games on ps3 & look into their sales on other platforms besides the ps3
ps3_top20 = ps3_df.sort_values(by='tot_sales',ascending=False).reset_index(drop=True).iloc[0:20][['name','tot_sales']]
ps3_top20

Unnamed: 0,name,tot_sales
0,grand theft auto v,21.05
1,call of duty: black ops ii,13.79
2,call of duty: modern warfare 3,13.33
3,call of duty: black ops,12.63
4,gran turismo 5,10.7
5,call of duty: modern warfare 2,10.61
6,grand theft auto iv,10.5
7,call of duty: ghosts,9.36
8,fifa soccer 13,8.17
9,battlefield 3,7.17


In [51]:
# create new df that only contains all the rows that titles are included in the list of top 20 ps2 games
games_df = df[df.name.isin(ps3_top20.name)]
print(games_df.shape)
# create pivot table see tot_sales for each game on each platform
games_df.pivot_table(index='name',columns='platform',values='tot_sales',aggfunc=sum)

(77, 13)



The provided callable <built-in function sum> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.



platform,3ds,ds,pc,ps2,ps3,ps4,psp,psv,wii,wiiu,x360,xone
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
assassin's creed iii,,,0.93,,6.44,,,,,0.35,5.29,
battlefield 3,,,2.78,,7.17,,,,,,7.32,
call of duty 4: modern warfare,,1.05,1.15,,6.68,,,,,,9.32,
call of duty: black ops,,0.58,,,12.63,,,,1.37,,14.62,
call of duty: black ops ii,,,1.52,,13.79,,,,,0.41,13.68,
call of duty: ghosts,,,0.69,,9.36,3.83,,,,0.35,10.24,2.92
call of duty: modern warfare 2,,,0.89,,10.61,,,,,,13.47,
call of duty: modern warfare 3,,,1.71,,13.33,,,,0.83,,14.73,
fifa 12,0.39,,0.47,0.08,6.64,,0.52,,0.76,,4.17,
fifa 14,0.23,,0.4,,6.46,3.01,0.19,0.41,0.38,,4.22,1.16


In [52]:
# create histogram based on sales across platforms for these games
hist6 = px.histogram(games_df,x='name',y='tot_sales',color='platform',
                     labels={'tot_sales':'Global Sales (USD Million)','name':'Game Title'},
                     title='Total Sales for Top Selling Games by Platform')
hist6.update_layout(title_font_size=20,height=600,legend_title_text='Platform')
hist6.update_xaxes(tickfont_size=11)
hist6.show()

#### Conclusion
<p>Looking at the distribution of sales for each of these popular games, you can see that the most popular platforms in these distributions are the ps3 and the xbox 360. You can also see that the game with the greatest total global sales was actually offered across many platforms including over both older and newer platforms (such as the ps3 & the ps4). It is also interesting to see that some games are only released on a single platform, such as gran turismo 5 & uncharted 2. This leads me to believe that some game production studios must have exclusive releases with certain platforms. 

In [53]:
# distribution of games in filtered period by genre
hist7 = px.histogram(filtered_df,x='genre', color='platform',
                     title='Distribution of Games by Genre (2013-2016)',
                     labels={'genre':'Genre'})
hist7.update_layout(legend_title_text='Platform',title_font_size=20,
                    height=500)
hist7.update_xaxes(tickangle=25)
hist7.show()

#### Conclusion
Looking at the distribution of games by genre in the period from 2013-2016, it is clear that action games are the most commonly released games. Role-playing & adventure games are also popular, but each of those genres have less than half the number of games as the action genre. 

In [54]:
# total sales distribution by genre
box4 = px.box(filtered_df,x='tot_sales',y='genre',color='genre',
              title='Distribution of Total Sales by Genre',
              labels={'genre':'Genre','tot_sales':'Global Sales (USD Million)'})
box4.update_layout(height=600,showlegend=False,boxgap=0.01,title_font_size=20)
box4.update_traces(boxmean=True)
box4.update_yaxes(tickangle=-25)
box4.show()

#### Conclusion

<p> The genres that are the most profitable are the shooter and sports genres. The distributions of these genres both have very large means and large value outliers. The distributions of these genres also have medians that are higher than most of the other genres as well. This means that over 50% of these games sell better than 50% of all other games in the other genres. It is also interesting to see that the most popular genre in terms of games produced, the action genre, has a lower median & mean, but also has a lot of large value outliers. This means that certain action games that become hits can have large total sales, but because there are so many more action games produced, it is probably harder to become such a distinguished hit. </p>

<p>The genre with the lowest sales average is the puzzle genre. We can also see that this is the least popular produced game genre as well. This makes sense as there really isn't a good chance of producing a hit puzzle game, so game studios are probably less likely to produce puzzle games and would rather produce games in other genres that are more likely to be more profitable. 

## User Profiles for Each Region (North America, Europe & Japan)

### NA Region Profile

In [55]:
# top 5 platforms of NA regions
NA_platforms = filtered_df.groupby('platform')['na_sales'].sum().sort_values(ascending=False).iloc[0:5]
NA_platforms

platform
ps4     108.74
xone     93.12
x360     81.66
ps3      63.50
3ds      38.20
Name: na_sales, dtype: float64

In [56]:
# top 5 genres of NA region
NA_genres = filtered_df.groupby('genre')['na_sales'].sum().sort_values(ascending=False).iloc[0:5]
NA_genres

genre
action          126.05
shooter         109.74
sports           65.27
role-playing     46.40
misc             27.49
Name: na_sales, dtype: float64

In [57]:
# box plot of north american sales across ratings
# create df with no missing values in rating column for filtered df
ratings_df = filtered_df[~filtered_df.rating.isna()]
box5 = px.box(ratings_df,x='na_sales',y='rating',color='rating',
              title='Distribution of North American sales by ESRB Rating (2013-2016)',
              labels={'rating':'ESRB Rating','na_sales':'North American Sales (USD Million)'})
box5.update_layout(showlegend=False,title_font_size=20,boxgap=0.1)
box5.update_traces(boxmean=True)
box5.show()

#### Conclusion
The most popular platforms in the North American region are the Playstation 4, Xbox One, Xbox 360, Playstation 3 and Nintendo 3ds. The top genres in the North American region are action, shooter, sports, role-playing and miscellaneous. Looking at the distributions of North American sales by ESRB ratings, games with a mature rating have a higher sales average. This leads me to believe that ratings do not necessarily have as much of an impact on if a game becomes a hit seller or not. While a mature rating can restrict sales of a game to customers of a certain age, this doesn't necessarily mean that it will limit the overall profitability of the game in North America. 

### EU Region

In [58]:
# top 5 platforms of EU regions
EU_platforms = filtered_df.groupby('platform')['eu_sales'].sum().sort_values(ascending=False).iloc[0:5]
EU_platforms

platform
ps4     141.09
ps3      67.81
xone     51.59
x360     42.52
3ds      30.96
Name: eu_sales, dtype: float64

In [59]:
# top 5 genres of EU region
EU_genres = filtered_df.groupby('genre')['eu_sales'].sum().sort_values(ascending=False).iloc[0:5]
EU_genres

genre
action          118.13
shooter          87.86
sports           60.52
role-playing     36.97
racing           20.19
Name: eu_sales, dtype: float64

In [60]:
# box plot of north american sales across ratings
box6 = px.box(ratings_df,x='na_sales',y='rating',color='rating',
              title='Distribution of North American sales by ESRB Rating (2013-2016)',
              labels={'rating':'ESRB Rating','na_sales':'North American Sales (USD Million)'})
box6.update_layout(showlegend=False,title_font_size=20,boxgap=0.1)
box6.update_traces(boxmean=True)
box6.show()

#### Conclusion
The most popular platforms in the European region are the Playstation 4, Playstation 3, Xbox One, Xbox 360 and Nintendo 3ds. The top genres in the European region are action, shooter, sports, role-playing & racing. The distributions of European sales for each ESRB rating are actually pretty similar to that of the North American Sales. Once again the rating with the highest average sales is the Mature rating. This once again leads me to believe that the rating doesn't necessarily impede a games ability to become a hit. 

### JP region

In [61]:
# top 5 platforms of JP regions
JP_platforms = filtered_df.groupby('platform')['jp_sales'].sum().sort_values(ascending=False).iloc[0:5]
JP_platforms

platform
3ds     67.81
ps3     23.35
psv     18.59
ps4     15.96
wiiu    10.88
Name: jp_sales, dtype: float64

In [62]:
# top 5 genres of JP region
JP_genres = filtered_df.groupby('genre')['jp_sales'].sum().sort_values(ascending=False).iloc[0:5]
JP_genres

genre
role-playing    51.04
action          40.49
misc             9.20
fighting         7.65
shooter          6.61
Name: jp_sales, dtype: float64

In [63]:
# box plot of Japanese sales across ratings
box7 = px.box(ratings_df,x='jp_sales',y='rating',color='rating',
              title='Distribution of Japanese sales by ESRB Rating (2013-2016)',
              labels={'rating':'ESRB Rating','jp_sales':'Japanese Sales (USD Million)'})
box7.update_layout(showlegend=False,title_font_size=20,boxgap=0.1)
box7.update_traces(boxmean=True)
box7.show()

#### Conclusion
The most popular platforms in Japan are the Nintendo 3ds, Playstation 3, Playstation Vita, Playstation 4 and Nintendo Wii U. The top genres in Japan are role-playing, action, miscellaneous, fighting & shooter games. Unlike the North American & European Regions, it does appear that ESRB rating does affect the total sales in Japan. Unike those other regions, the rating with the highest sales average in Japan is the teen rating. The mature rating distribution's average is also significantly lower than for the North American and European regions. 

### Conclusion
Overall the North American and European regions are fairly similar in terms of popular platforms, popular genres and how ratings affect the sales of games. The Japanese region differs greatly from the other two regions in all of these aspects. 

## Statistical Hypotheses Testing

<p>Test Hypothesis: Average user ratings of the Xbox One and PC platforms are the same</p>

- Null Hypothesis: The average user ratings of the Xbox One and PC platforms are equal
- Alternative Hypothesis: The average user ratings of the Xbox One and PC platforms are not equal
- Alpha (Significance) Value: 5 percent

In [64]:
# create series of user ratings for the Xbox One and PC platforms
xone_ratings = filtered_df[filtered_df.platform=='xone']['user_score']
xone_ratings = xone_ratings[~xone_ratings.isna()]
pc_ratings = filtered_df[filtered_df.platform=='pc']['user_score']
pc_ratings = pc_ratings[~pc_ratings.isna()]

In [65]:
# test the hypotheses
alpha = 0.05
results1 = st.ttest_ind(xone_ratings,pc_ratings)
print(f'p-value: {results1.pvalue}')

p-value: 0.14012658403611647


#### Conclusion
The pvalue is slightly greater than the determined alpha value. We cannot reject the null hypothesis and therefore cannot accept the alternative hypothesis. We can determine the the average user ratings of the Xbox One and PC are the equal.

<p>Test Hypothesis: Average user ratings for the Action and Sports genres are different.</p>

- Null Hypothesis: The average user ratings of the Action and Sports genre are equal
- Alternative Hypothesis: The average user ratings of the Action and Sports genres are not equal
- Alpha (Significance) Value: 5 percent

In [66]:
# create series of user ratings for the Action and Sports genres
action_ratings = filtered_df[filtered_df.genre=='action']['user_score']
action_ratings = action_ratings[~action_ratings.isna()]
sports_ratings = filtered_df[filtered_df.genre=='sports']['user_score']
sports_ratings = sports_ratings[~sports_ratings.isna()]

In [67]:
# test the hypotheses
alpha = 0.05
results2 = st.ttest_ind(action_ratings,sports_ratings)
print(f'p-value: {results2.pvalue}')

p-value: 1.0517832389140023e-27


#### Conclusion
The pvalue is smaller than the determined alpha value. We can reject the null hypothesis and can determine that the average user ratings of the Action and Sports genres are not equal. 

## Conclusion
<p>After conducting the analysis of video games sales over the years, I was able to see that the lifespan for most of the different platforms were about 10 years. Using that information I was able to hone in on the video game sales data from 2013 to 2016 to help determine what games have the greatest potential to become hits. Since there are many platforms that are declining in popularity, the more popular hit games for 2017 will most likely be one the platforms that are currently rising in popularity, such as the playstation 4 & and the Xbox One. Also the game genres that are the most popular in this period are shooter and sports games. There are also other genres that are very popular that can still be very profitable, such as action and role-playing games. I was also able to analyze the behaviors of customers in the different regions (North America, Europe & Japan). By analyzing these regional profiles, I was able to determine that the profiles for North America and Europe are both fairly similar and that ESRB ratings do not have as much of an effect on sales in these regions. The Japan region profile is quite different with different popular platforms & genres. The ESRB ratings in Japan also do have more of an impact on the overall sales in that region. </p>
<p> With all of these conclusion that I was able to draw from the data I was able to determine that it would be best to back games that are released on the newer platforms such as the Playstation 4 and the Xbox One. It would also be best to back games that are of the shooter, sports, action & role-playing genres. It would also be best to back games that span all of the ESRB ratings as mature rated games tend to be higher sellers in North America & Europe, but teen & everyone rated games tend to be higher sellers in Japan. 