# Video Game Analysis

This project utilizes a dataset on video game platform, year of release, genre, sales, and ratings for 16715 games released between the years of 1980 to 2016. The dataset was accessed through Practicum's Data Science Bootcamp site and loaded into Jupyter Notebook for analysis using Python. The purpose of this project is to conduct data cleaning, exploratory data analysis, data visualizations, and independent samples t-tests to help predict video game sales and ratings in 2017. This project is organized into 5 parts:

1. Import and clean data for missing values and duplicates.
2. Exploratory data analysis and data visualizations of video games by year, platform, game, critic score, and genre.
3. Summary statistics by region to create regional customer profiles.
4. Hypothesis tests comparing average user ratings by platform and by genre.
5. Project conclusion and business application.

## 1. Import and Clean Data

In this section, the data are imported, checked for missing values and duplicates, and a column for total sales is added. 

### 1a. Import Libraries and Data

In [1]:
# Import Libraries
import numpy as np
import pandas as pd
import plotly_express as px
import matplotlib.pyplot as plt
import scipy.stats as st

In [2]:
# Read in dataset
vg = pd.read_csv("Users/kellyshreeve/Desktop/Data-Sets/moved_games.csv")

In [3]:
# Print dataset info
vg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16713 non-null  object 
 1   Platform         16715 non-null  object 
 2   Year_of_Release  16446 non-null  float64
 3   Genre            16713 non-null  object 
 4   NA_sales         16715 non-null  float64
 5   EU_sales         16715 non-null  float64
 6   JP_sales         16715 non-null  float64
 7   Other_sales      16715 non-null  float64
 8   Critic_Score     8137 non-null   float64
 9   User_Score       10014 non-null  object 
 10  Rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB


In [4]:
# View first 15 rows of the dataset 
vg.head(15)

Unnamed: 0,Name,Platform,Year_of_Release,Genre,NA_sales,EU_sales,JP_sales,Other_sales,Critic_Score,User_Score,Rating
0,Wii Sports,Wii,2006.0,Sports,41.36,28.96,3.77,8.45,76.0,8.0,E
1,Super Mario Bros.,NES,1985.0,Platform,29.08,3.58,6.81,0.77,,,
2,Mario Kart Wii,Wii,2008.0,Racing,15.68,12.76,3.79,3.29,82.0,8.3,E
3,Wii Sports Resort,Wii,2009.0,Sports,15.61,10.93,3.28,2.95,80.0,8.0,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,11.27,8.89,10.22,1.0,,,
5,Tetris,GB,1989.0,Puzzle,23.2,2.26,4.22,0.58,,,
6,New Super Mario Bros.,DS,2006.0,Platform,11.28,9.14,6.5,2.88,89.0,8.5,E
7,Wii Play,Wii,2006.0,Misc,13.96,9.18,2.93,2.84,58.0,6.6,E
8,New Super Mario Bros. Wii,Wii,2009.0,Platform,14.44,6.94,4.7,2.24,87.0,8.4,E
9,Duck Hunt,NES,1984.0,Shooter,26.93,0.63,0.28,0.47,,,


This dataset has a total of 16715 entries. There are missing values in: Name, Year_of_Release, Genre, Critic_Score, User_Score, and Rating. Year_of_Release is mis-typed as float and needs to be changed to int and User_Score is mis-typed as object and needs to be changed to float data type. 

### 1b. Rename Columns and Fix Data Types

In [5]:
# Change variable names to snake case
vg.columns = vg.columns.str.lower()

print(vg.columns)

Index(['name', 'platform', 'year_of_release', 'genre', 'na_sales', 'eu_sales',
       'jp_sales', 'other_sales', 'critic_score', 'user_score', 'rating'],
      dtype='object')


#### 1bi. Change Year of Release to int

In [6]:
# Frist, check unique values of 'year_of_release'
print(vg['year_of_release'].unique())

[2006. 1985. 2008. 2009. 1996. 1989. 1984. 2005. 1999. 2007. 2010. 2013.
 2004. 1990. 1988. 2002. 2001. 2011. 1998. 2015. 2012. 2014. 1992. 1997.
 1993. 1994. 1982. 2016. 2003. 1986. 2000.   nan 1995. 1991. 1981. 1987.
 1980. 1983.]


The values are all whole numbers, so it is safe to convert 'year_of_release' to int.

In [7]:
# Convert 'year_of_release' to int
vg['year_of_release'] = vg['year_of_release'].astype('Int64') # Change type

type = vg['year_of_release'].dtypes # Save new type
print(f'The dtype for "year_of_release" now is: {type}') # Print new type

The dtype for "year_of_release" now is: Int64


#### 1bii. Change User Score to float

In [8]:
# Print unique values in 'user_score'
print(vg['user_score'].unique())

['8' nan '8.3' '8.5' '6.6' '8.4' '8.6' '7.7' '6.3' '7.4' '8.2' '9' '7.9'
 '8.1' '8.7' '7.1' '3.4' '5.3' '4.8' '3.2' '8.9' '6.4' '7.8' '7.5' '2.6'
 '7.2' '9.2' '7' '7.3' '4.3' '7.6' '5.7' '5' '9.1' '6.5' 'tbd' '8.8' '6.9'
 '9.4' '6.8' '6.1' '6.7' '5.4' '4' '4.9' '4.5' '9.3' '6.2' '4.2' '6' '3.7'
 '4.1' '5.8' '5.6' '5.5' '4.4' '4.6' '5.9' '3.9' '3.1' '2.9' '5.2' '3.3'
 '4.7' '5.1' '3.5' '2.5' '1.9' '3' '2.7' '2.2' '2' '9.5' '2.1' '3.6' '2.8'
 '1.8' '3.8' '0' '1.6' '9.6' '2.4' '1.7' '1.1' '0.3' '1.5' '0.7' '1.2'
 '2.3' '0.5' '1.3' '0.2' '0.6' '1.4' '0.9' '1' '9.7']


user_score values include float, nan, and tbd. The tbd data need futher exploration to determine whether the value is associated with a specific year, country, or rating. 

In [9]:
# Check the dataset for patterns where user_score == tbd
display(vg[vg['user_score']=='tbd'].head(30))

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
119,Zumba Fitness,Wii,2010.0,Sports,3.45,2.59,0.0,0.66,,tbd,E
301,Namco Museum: 50th Anniversary,PS2,2005.0,Misc,2.08,1.35,0.0,0.54,61.0,tbd,E10+
520,Zumba Fitness 2,Wii,2011.0,Sports,1.51,1.03,0.0,0.27,,tbd,T
645,uDraw Studio,Wii,2010.0,Misc,1.65,0.57,0.0,0.2,71.0,tbd,E
657,Frogger's Adventures: Temple of the Frog,GBA,,Adventure,2.15,0.18,0.0,0.07,73.0,tbd,E
718,Just Dance Kids,Wii,2010.0,Misc,1.52,0.54,0.0,0.18,,tbd,E
726,Dance Dance Revolution X2,PS2,2009.0,Simulation,1.09,0.85,0.0,0.28,,tbd,E10+
821,The Incredibles,GBA,2004.0,Action,1.15,0.77,0.04,0.1,55.0,tbd,E
881,Who wants to be a millionaire,PC,1999.0,Misc,1.94,0.0,0.0,0.0,,tbd,E
1047,Tetris Worlds,GBA,2001.0,Puzzle,1.25,0.39,0.0,0.06,65.0,tbd,E


Video Games with user_score of tbd are all associated with jp_sales of almost zero. I will check if there are any jp_sales of almost zero that have user ratings other than tbd to confirm whether this is the reason for the rating. 

In [10]:
# Check if all jp_sales close to zero have a user_score of tbd
display(vg[(vg['jp_sales'] >= 0) & (vg['jp_sales'] <= .05)].head(30)) # Print rows with 0 <= jp_sales <= 0.5

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
60,Call of Duty: Ghosts,X360,2013.0,Shooter,6.73,2.56,0.04,0.91,73.0,2.6,M
61,Just Dance 3,Wii,2011.0,Misc,5.95,3.11,0.0,1.06,74.0,7.8,E10+
66,Halo 4,X360,2012.0,Shooter,6.65,2.28,0.04,0.74,87.0,7,M
68,Just Dance 2,Wii,2010.0,Misc,5.8,2.85,0.01,0.78,74.0,7.3,E10+
72,Minecraft,X360,2013.0,Misc,5.7,2.65,0.02,0.81,,,
78,Halo 2,XB,2004.0,Shooter,6.82,1.53,0.05,0.08,95.0,8.2,M
85,The Sims 3,PC,2009.0,Simulation,0.99,6.42,0.0,0.6,86.0,7.6,T
89,Pac-Man,2600,1982.0,Puzzle,7.28,0.45,0.0,0.08,,,
99,Call of Duty: Black Ops 3,XOne,2015.0,Shooter,4.59,2.11,0.01,0.68,,,
100,Call of Duty: World at War,X360,2008.0,Shooter,4.81,1.88,0.0,0.69,84.0,7.6,M


There are user_scores present for other instances of jp_sales close to zero. It does not appear that these sales figures are the reason for the tbd rating. There do not appear to be any other patterns in year, genre, sales, critic_score, or rating that would explain the user_score of tbd. Because there are no clear patterns explaining this value, I will treat tbd as a missing value. 

In [11]:
# Fill user_score tbd with nan
vg['user_score'] = vg['user_score'].replace('tbd', np.nan) # Replace tbd with nan

print('Unique values of user_score:') 
print()

print(vg['user_score'].unique()) # Print unique values

Unique values of user_score:

['8' nan '8.3' '8.5' '6.6' '8.4' '8.6' '7.7' '6.3' '7.4' '8.2' '9' '7.9'
 '8.1' '8.7' '7.1' '3.4' '5.3' '4.8' '3.2' '8.9' '6.4' '7.8' '7.5' '2.6'
 '7.2' '9.2' '7' '7.3' '4.3' '7.6' '5.7' '5' '9.1' '6.5' '8.8' '6.9' '9.4'
 '6.8' '6.1' '6.7' '5.4' '4' '4.9' '4.5' '9.3' '6.2' '4.2' '6' '3.7' '4.1'
 '5.8' '5.6' '5.5' '4.4' '4.6' '5.9' '3.9' '3.1' '2.9' '5.2' '3.3' '4.7'
 '5.1' '3.5' '2.5' '1.9' '3' '2.7' '2.2' '2' '9.5' '2.1' '3.6' '2.8' '1.8'
 '3.8' '0' '1.6' '9.6' '2.4' '1.7' '1.1' '0.3' '1.5' '0.7' '1.2' '2.3'
 '0.5' '1.3' '0.2' '0.6' '1.4' '0.9' '1' '9.7']


All tbd are converted to nan.

In [12]:
# Change user_score to float type
vg['user_score'] = pd.to_numeric(vg['user_score']) # Convert to float

vg.dtypes

name                object
platform            object
year_of_release      Int64
genre               object
na_sales           float64
eu_sales           float64
jp_sales           float64
other_sales        float64
critic_score       float64
user_score         float64
rating              object
dtype: object

Year_of_release is data type int and user_score is data type float. All variables are now the correct data type.

### 1c. Address Missing Values

In this section, the number of missing values per variable is displayed and the missing values for each variable are explored and filled using logical imputation, when appropriate.

In [13]:
# Display the number of missing values in each variable
print('The number of missing values in each variable:')

print(vg.isna().sum())

The number of missing values in each variable:
name                  2
platform              0
year_of_release     269
genre                 2
na_sales              0
eu_sales              0
jp_sales              0
other_sales           0
critic_score       8578
user_score         9125
rating             6766
dtype: int64


There are missing values for name, year_of_release, genre, critic_score, user_score, and rating. I will further explore the missing values for each variable to determine if and how missing values should be filled.

#### 1ci. Name and Genre

There are only two missing values each for name and genre, so the rows with missing data can be printed and addressed.

In [14]:
# Display missing values for name
print('The missing values for name:')

display(vg[vg['name'].isna()])

The missing values for name:


Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
659,,GEN,1993,,1.78,0.53,0.0,0.08,,,
14244,,GEN,1993,,0.0,0.0,0.03,0.0,,,


The two rows missing on name are also the two rows missing on genre. These rows are additionally missing critic_score, user_score, and rating but do have complete information for platform, year, and sales. I will fill name and genre with 'unknown' and address the critic_score, user_score, and rating later on.

In [15]:
# Fill missing values in name with unknown
vg['name'] = vg['name'].fillna('unknown')

# Fill missing values in genre with unknown
vg['genre'] = vg['genre'].fillna('unknown')

display(vg.iloc[[659, 14244], ])

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
659,unknown,GEN,1993,unknown,1.78,0.53,0.0,0.08,,,
14244,unknown,GEN,1993,unknown,0.0,0.0,0.03,0.0,,,


Missing values for Name and Genre are filled with 'unknown'.

#### 1cii. Year of Release

In [16]:
# Display a sample of missing values for year_of_release
print('A sample of rows with missing values for year_of_release:')
display(vg[vg['year_of_release'].isna()].head(15))

A sample of rows with missing values for year_of_release:


Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
183,Madden NFL 2004,PS2,,Sports,4.26,0.26,0.01,0.71,94.0,8.5,E
377,FIFA Soccer 2004,PS2,,Sports,0.59,2.36,0.04,0.51,84.0,6.4,E
456,LEGO Batman: The Videogame,Wii,,Action,1.8,0.97,0.0,0.29,74.0,7.9,E10+
475,wwe Smackdown vs. Raw 2006,PS2,,Fighting,1.57,1.02,0.0,0.41,,,
609,Space Invaders,2600,,Shooter,2.36,0.14,0.0,0.03,,,
627,Rock Band,X360,,Misc,1.93,0.33,0.0,0.21,92.0,8.2,T
657,Frogger's Adventures: Temple of the Frog,GBA,,Adventure,2.15,0.18,0.0,0.07,73.0,,E
678,LEGO Indiana Jones: The Original Adventures,Wii,,Action,1.51,0.61,0.0,0.21,78.0,6.6,E10+
719,Call of Duty 3,Wii,,Shooter,1.17,0.84,0.0,0.23,69.0,6.7,T
805,Rock Band,Wii,,Misc,1.33,0.56,0.0,0.2,80.0,6.3,T


There are not any apparent patterns in platform, genre, sales, score, or rating that explain the missing year_of_release values. A google search of the names shows that these games were released in all different years. Because a major portion of this analysis is to determine patterns based on year of release, filling these 236 values with the mean or median could skew the results in favor of that year. This is also a relatively small number of missing values, accounting for only 1.4% of the data. Therefore, I will leave these missing values and not use these games in analyses that include year_of_release.

#### 1ciii. Critic Score, User Score, and Rating

Missing values for critic score (8578), user score (9125), and rating (6766) account for close to 50% of the observations. Therefore imputation will not be attempted and the scores will be left as missing. The missing values will be ignored in analyses that use these variables.

### 1d. Check for Duplicates

In [59]:
# Check for implicit duplicate names
names = sorted(vg['name'].unique()) # Print all names in alphabetical order

name_list = []
for name in names:
    name_list.append(name)
    
display(name_list[0:20])

[' beyblade burst',
 ' fire emblem fates',
 " frozen: olaf's quest",
 ' haikyu!! cross team match!',
 ' tales of xillia 2',
 "'98 koshien",
 '.hack//g.u. vol.1//rebirth',
 '.hack//g.u. vol.2//reminisce',
 '.hack//g.u. vol.2//reminisce (jp sales)',
 '.hack//g.u. vol.3//redemption',
 '.hack//infection part 1',
 '.hack//link',
 '.hack//mutation part 2',
 '.hack//outbreak part 3',
 '.hack//quarantine part 4: the final chapter',
 '.hack: sekai no mukou ni + versus',
 '007 racing',
 '007: quantum of solace',
 '007: the world is not enough',
 '007: tomorrow never dies']

There are no implicitly duplicated video game names.

In [18]:
# Check for fully duplicate rows
vg['name'] = vg['name'].str.lower()
vg['platform'] = vg['platform'].str.lower()

duplicates = vg.duplicated().sum()

print(f'The number of fully duplicate rows is: {duplicates}')

The number of fully duplicate rows is: 0


In [19]:
# Check for implicit duplicate name - platform - year duplicates
name_plat_duplicates = vg[['name', 'platform', 'year_of_release']].duplicated().sum()

print(f'The number of name-platform-year duplicates is: {name_plat_duplicates}')

The number of name-platform-year duplicates is: 2


In [20]:
# View the 2 duplicated rows
print('The two rows with duplicates are:')
display(vg[vg[['name', 'platform', 'year_of_release']].duplicated()==True])

The two rows with duplicates are:


Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
14244,unknown,gen,1993,unknown,0.0,0.0,0.03,0.0,,,
16230,madden nfl 13,ps3,2012,Sports,0.0,0.01,0.0,0.0,83.0,5.5,E


In [21]:
# Display the original row and duplicate for the first duplicated row
print('The first duplicated rows are:')
display(vg[(vg['name']=='unknown') & (vg['platform']=='gen') & (vg['year_of_release']==1993)])

The first duplicated rows are:


Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
659,unknown,gen,1993,unknown,1.78,0.53,0.0,0.08,,,
14244,unknown,gen,1993,unknown,0.0,0.0,0.03,0.0,,,


These two rows are identical other than na_sales, eu_sales, jp_sales, and other_sales. Because the sales figures are almost zero for the second row, I believe the second row is a mistake. I will delete the second row.

In [22]:
# Display the original row and the duplicate for the second duplicated row
print('The second duplicated rows are:')
display(vg[(vg['name']=='madden nfl 13') & (vg['platform']=='ps3') & (vg['year_of_release']==2012)])

The second duplicated rows are:


Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
604,madden nfl 13,ps3,2012,Sports,2.11,0.22,0.0,0.23,83.0,5.5,E
16230,madden nfl 13,ps3,2012,Sports,0.0,0.01,0.0,0.0,83.0,5.5,E


These two rows are identical other than na_sales, eu_sales, jp_sales, and other_sales. Because the sales figures are almost zero for the second row, I believe the second row is a mistake. I will delete the second row.


In [23]:
# Drop the implicit duplicate rows
vg = vg.drop_duplicates(subset=['name', 'platform', 'year_of_release']).reset_index(drop=True)

In [24]:
# Check the duplicate name-platform-year are removed
name_plat_duplicates_2 = vg[['name', 'platform', 'year_of_release']].duplicated().sum()

print(f'The number of name-platform-year duplicates now is: {name_plat_duplicates_2}')

The number of name-platform-year duplicates now is: 0


### 1e. Add Additional Features

A column is added for the total sales across North America, Europe, Japan, and Other countries.

In [25]:
# Calculate total sum of sales across all regions
vg['total_sales'] = vg.iloc[:, 4:8].sum(axis=1)

vg.head()

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating,total_sales
0,wii sports,wii,2006,Sports,41.36,28.96,3.77,8.45,76.0,8.0,E,82.54
1,super mario bros.,nes,1985,Platform,29.08,3.58,6.81,0.77,,,,40.24
2,mario kart wii,wii,2008,Racing,15.68,12.76,3.79,3.29,82.0,8.3,E,35.52
3,wii sports resort,wii,2009,Sports,15.61,10.93,3.28,2.95,80.0,8.0,E,32.77
4,pokemon red/pokemon blue,gb,1996,Role-Playing,11.27,8.89,10.22,1.0,,,,31.38


### 1f. Data Cleaning Conclusion

In this section, the data were read in and checked for correct data types, missing values, and duplicates. Year of release was changed to int data type and user score to float data type. Missing values for name and genre were filled with 'unknown'. Due to the large number of missing values, imputation was not attempted for year of release, critic score, user score, or rating values. An additional column was added for total sales across North America, Eurpoe, Japan, and other countries. There are no longer concerns with incorrect data types, duplicates, or missing values. The data is now clean and ready for analysis.

## 2. Descriptive Statistics and Data Visualizations

This section uses the distribution of games released by year and games released by year by platform to determine which subset of years from 1980 - 2016 will be best suited to make predictions for 2017. The subset of years is then used to find the platforms that are leading in sales in each year, which platforms are on the rise, and which platforms are on the decline. The same subset is then used to find which platforms are leading in global sales, whether critic and user scores are correlated with global sales, to compare global sales on individual games on different platforms, and to find the leading sales by genre. 

### 2a. Games Released by Year

In [26]:
# Calculate the number of games released by year
games_by_year = vg.groupby('year_of_release')['name'].count().reset_index().rename(columns={'name':'frequency'})

display(games_by_year)

Unnamed: 0,year_of_release,frequency
0,1980,9
1,1981,46
2,1982,36
3,1983,17
4,1984,14
5,1985,14
6,1986,21
7,1987,16
8,1988,15
9,1989,17


In [27]:
# Create bar graph of video games released by year
year_bar = px.bar(games_by_year, x='year_of_release', y='frequency',
                  title='Games Released by Year', 
                  labels={'year_of_release':'Year of Release',
                          'frequency':'Number of Games Released'},
                  color_discrete_sequence=[px.colors.qualitative.D3[0]],
                  width=800, height=500)

year_bar.update_layout({
    'plot_bgcolor':'rgba(0, 0, 0, 0)',
    'paper_bgcolor':'rgba(0, 0, 0, 0)'
}) 

year_bar.update_xaxes(showgrid=False)
year_bar.update_yaxes(range=[0, 1800], showgrid=False) 


year_bar.show()


This distribution of games released by year is highly left skewed, meaning not very many games were released each year from 1980 - 1995, while a majority of video games were released between 2000 and 2012. From the frequency table, we can see that fewer than 100 games were released per year before 1994. This makes sense, because video game consoles were not common household goods until the late 1990s. The number of games released per year increased from 1990 until 2008. The year with the highest number of video games released is 2008 with 1427 games released that year, closely followed by 2009 with 1426 game releases.  The number of games per year then saw a general decline from 2009 until the end of this data in 2016. 

### 2b. Total Sales by Year by Platform

In [28]:
# Calculate total sales by platform
sales_by_plat = vg.groupby('platform')['total_sales'].sum().reset_index().sort_values(
    'total_sales', ascending=False).reset_index(drop=True)

print(f'Total sales by platform from 1980 - 2016:')
display(sales_by_plat)

Total sales by platform from 1980 - 2016:


Unnamed: 0,platform,total_sales
0,ps2,1255.77
1,x360,971.42
2,ps3,939.64
3,wii,907.51
4,ds,806.12
5,ps,730.86
6,gba,317.85
7,ps4,314.14
8,psp,294.05
9,pc,259.52


In [29]:
# Create a bar chart for total sales by platform
plat_bar = px.bar(sales_by_plat, x='platform', y='total_sales', 
                  title='Total Sales by Platform',
                  labels={'total_sales':'Total Sales (USD million)', 'platform':'Platform'},
                  color_discrete_sequence=[px.colors.qualitative.D3[0]],
                  width=800, height=500)

plat_bar.update_layout({
    'plot_bgcolor':'rgba(0, 0, 0, 0)',
    'paper_bgcolor':'rgba(0, 0, 0, 0)'
})

plat_bar.update_layout(xaxis={'categoryorder':'total descending'}) # Arrange in order from expensive to inexpensive

plat_bar.update_xaxes(showgrid=False, tickangle=45) # Turn off x grid
plat_bar.update_yaxes(range=[0, 1400], showgrid=False) # Turn off y grid

plat_bar.show()

The sorted bar graph of total sales by platform shows a clear drop off betweeen the six highest selling platforms and all the rest. Those six highest-performing platforms are ps2, x360, ps3, wii, ds, and ps. I will use these six top selling platforms as my focus and further investigate sales per year by platform. 

In [30]:
# Create table of total sales by year by platform  
top_platforms = sales_by_plat.iloc[:6, 0] # List of top 6 platforms

vg_top_plat = vg.query('platform in @top_platforms') # Query vg dataframe for only top brands

sales_y_p = vg_top_plat.groupby(['year_of_release', 'platform'])['total_sales'].sum().reset_index() # Table of sales by year by platform

print(sales_y_p.to_string())

    year_of_release platform  total_sales
0              1985       ds         0.02
1              1994       ps         6.03
2              1995       ps        35.96
3              1996       ps        94.70
4              1997       ps       136.17
5              1998       ps       169.49
6              1999       ps       144.53
7              2000       ps        96.37
8              2000      ps2        39.17
9              2001       ps        35.59
10             2001      ps2       166.43
11             2002       ps         6.67
12             2002      ps2       205.38
13             2003       ps         2.07
14             2003      ps2       184.31
15             2004       ds        17.27
16             2004      ps2       211.81
17             2005       ds       130.14
18             2005      ps2       160.66
19             2005     x360         8.25
20             2006       ds       119.81
21             2006      ps2       103.42
22             2006      ps3      

In [31]:
# Histogram of total sales by year by platform

# Create histogram
sales_y_p_hist = px.line(sales_y_p, x='year_of_release', y='total_sales', 
                         color='platform',
                              title='Total Sales by Year by Platform', 
                              labels={'total_sales':'Total Sales (US millions)',
                                      'year_of_release':'Year of Release'},
                              color_discrete_sequence=px.colors.qualitative.D3)

sales_y_p_hist.update_layout({
    'plot_bgcolor':'rgba(0, 0, 0, 0)',
    'paper_bgcolor':'rgba(0, 0, 0, 0)'
})

sales_y_p_hist.update_traces(opacity=0.7)

sales_y_p_hist.update_xaxes(showgrid=False)
sales_y_p_hist.update_yaxes(range=[0, 250], showgrid=False)

sales_y_p_hist.show()

The peak and general lifespan of a platform can be found from the grouped table and histogram of total sales by year by platform. From these distributions, it can be seen that ps, ps2, and ds all peaked sales in the late 1990s and early 2000s and now have zero sales in 2016. The manufactures likely discontinued these models in favor of the ps3 and 3ds. The wii, x360, and ps3 all peaked in sales around 2010, and while in decline, still continue to have sales in 2016. The distribution of sales by year by game shows that it generally takes about 10 years for a device to come on the market, peak, and then be discontinued.

### 2c. Determine which Years to Use

Based on the above analysis, I argue it's appropriate to use the previous 5 years (2012-2016) to build a prognosis for 2017. The reasoning is two fold. First, the number of games released each year saw a sharp decline from 2011-2012. The number of games released per year has since stayed fairly consistent from 2012-2016. Second, devices tend to take about 10 years to come on the market, peak, and then decline. Using more than 5 years time would catch the tail-end of popularity for devices that are now off the market. For example, the ds saw $145 USD millions sales in 2008, yet is now off the market. Additionally the wii saw $171 USD million sales in 2008 and has since dropped to almost $0 USD million in 2016. Starting the prediction period in 2012 misses the ds and wii boom yet catches the end of x360 and ps3 and will also catch any games that are now on the rise.

### 2d. Find Platforms that are Leading in Sales

In [32]:
# Subset data to include only the last 5 years
vg_5y = vg[vg['year_of_release'] >= 2012]

years_unique = sorted(vg_5y['year_of_release'].unique())

print(f'The data has been subset to include only these years: {years_unique}')

The data has been subset to include only these years: [2012, 2013, 2014, 2015, 2016]


In [33]:
# Calculate platform sales by year
years_platform = vg_5y.pivot_table(index='year_of_release', 
                                   columns='platform', 
                                   values='total_sales', 
                                   aggfunc='sum').reset_index()

display(years_platform)

platform,year_of_release,3ds,ds,pc,ps3,ps4,psp,psv,wii,wiiu,x360,xone
0,2012,51.36,11.01,23.22,107.35,,7.69,16.19,21.71,17.56,99.74,
1,2013,56.57,1.54,12.38,113.25,25.99,3.14,10.59,8.59,21.65,88.58,18.96
2,2014,43.76,,13.28,47.76,100.0,0.24,11.9,3.75,22.03,34.74,54.07
3,2015,27.78,,8.52,16.82,118.9,0.12,6.25,1.14,16.35,11.96,60.14
4,2016,15.14,,5.25,3.6,69.25,,4.25,0.18,4.6,1.52,26.15


In [34]:
# Create bar chart of sales by year of release by platform
sales_p_y_5_line = px.line(years_platform, x='year_of_release', 
                          y=['3ds', 'ds', 'pc', 'ps3', 'ps4', 'psp', 'psv', 'wii', 'wiiu', 'x360', 'xone'],
                          title='Total Sales by Platform by Year',
                          labels={'total_sales':'Total Sales (USD mills)',
                                  'year_of_release':'Year of Release'},
                          color_discrete_sequence=px.colors.qualitative.D3,
                          width=1000, height=500)

sales_p_y_5_line.update_layout({
    'plot_bgcolor':'rgba(0, 0, 0, 0)',
    'paper_bgcolor':'rgba(0, 0, 0, 0)'
})

#! UPDATE X AXIS

sales_p_y_5_line.show()

Total video game sales across all platforms generally declined from 2012 through 2016. Ps3 and x360 had relatively large sales in 2012 and 2013 but have since dropped to almost $0 USD mills in 2016. 3ds had a slight decline from it's strongest sales in 2012 and 2013 but does still have notable sales in 2016 at $15.14 USD mills. However, both xone and ps4 were non-existant in 2012 and have seen substantial growth since their introduction in 2013. They both grew substantially from 2013 to 2014 and again slightly from 2014 to 2015. Both platforms took a slight decline in 2016, but the market as a whole took an even larger hit that year, and xone and ps4 still faired well in sales, bringing in $60.14 and $69.25 USD mils, respectively. Given that a platform's sales tend to start small and grow each year for about 5 years, I predict that xone and ps4 will continue to be sale leaders in 2017. I would additionally predict that 3ds will continue to stay in the 3rd spot with smaller, but next-largest, sales.

### 2e. Compare Global Total Sales by Platform

In [35]:
# Calculate Averge and Median Sales by Platorm 
avg_platform = vg_5y.pivot_table(index='platform', values='total_sales', aggfunc=['mean', 'median']).round(2) # Pivot table

# Make pivot table into data frame
avg_platform.columns = avg_platform.columns.droplevel(1)

avg_platform.columns.name = None

avg_platform = avg_platform.reset_index().rename(columns={'mean':'mean_sales', 'median':'median_sales'})

display(avg_platform)

Unnamed: 0,platform,mean_sales,median_sales
0,3ds,0.49,0.11
1,ds,0.4,0.05
2,pc,0.25,0.08
3,ps3,0.59,0.2
4,ps4,0.8,0.2
5,psp,0.06,0.03
6,psv,0.12,0.05
7,wii,0.65,0.18
8,wiiu,0.56,0.22
9,x360,0.81,0.31


In [36]:
# Create bar chart comparing the mean and median sales for each platform
plat_mean_bar = px.bar(avg_platform, x='platform', y=['mean_sales', 'median_sales'], barmode='group',
                       title='Mean and Median Total Sales by Platform', 
                       labels={'value':'Total Sales (USD millions)', 'platform':'Platform'},
                       color_discrete_sequence=[px.colors.qualitative.D3[0],
                                                px.colors.qualitative.D3[3]],
                       width=1000, height=500)

plat_mean_bar.update_layout({
    'plot_bgcolor':'rgba(0, 0, 0, 0)',
    'paper_bgcolor':'rgba(0, 0, 0, 0)'
})

plat_mean_bar.update_layout(legend_title_text='Statistic', xaxis={'categoryorder':'total descending'})

plat_mean_bar.show()

In [37]:
# Build a boxplot of sales for individual games by platform
sales_box = px.box(vg_5y, x='platform', y='total_sales', hover_data=['name'],
                   title='Global Sales by Platform',
                   labels={'total_sales':'Total Sales (USD million)', 'platform':'Platform'},
                color_discrete_sequence=[px.colors.qualitative.D3[0]],
                   width=1000, height=1300)

sales_box.update_layout({
    'plot_bgcolor':'rgba(0, 0, 0, 0)',
    'paper_bgcolor':'rgba(0, 0, 0, 0)'
})

sales_box.update_xaxes(showgrid=False)
sales_box.update_yaxes(range=[0, 23], showgrid=False)


The distribution of game sales by platform is highly right-skewed for all platforms. That means that for each platform, most of the games bring in $2 USD million or less, however, there are a number of individual games that bring in very high sales. The platform with the highest upper outlier boundary is x360 at $2.05 USD million, meaning that is the Q3 + 1.5IQR cutoff for that platform. However, the upper outlier bounaries are not representative of the upper selling games for any of these platfoms. All platfoms have a large number of upper outliers. This  means that while a majority of their games make around $2 USD million or less, they all have a number of games that are high revenue leaders. 

For example, while the median total sales of games on ps3 is $0.20 USD million, meaning half of ps3 games bring in $0.20 USD million or less, ps3 also has Grand Theft Auto V that brings in a whopping $21.05 USD million. As another example, wii has a median sales of $0.18 USD million, meaning half of wii games bring in $0.18 USD million or less, but it's highest gross game makes almost 40 times that amount at $7.09 USD million. This is true for all of the platforms. 

Because of these extremely high sellers, the platforms are able to bring their average revenue up far above their median revenue. The extreme high earners pull the average up. This can be seen in the bar chart diplaying median vs average revenue for each platform. The average revenues are about $0.40 USD million higher than the median revenues for each platform. The high skew means that the median is a better statistical measure of center for each platform, but if the platforms wanted to report their sales figures in reports or to stakeholders, they would look better presenting the mean. 

Comparing the platforms, x360 and ps4 have the highest mean sales, while x360, xone, and wiiu have the highest median sales. When looking at both the mean and the median, there are 6 clear top performers: x360, xone, ps3, ps4, wii, and wiiu. However, when looking at the boxplots, x360, ps3, ps4, and 3ds have the highest grossing individual games by far. Keeping both metrics in mind, I argue the top 5 performing platforms in 2012-2016 were x360, xone, ps3, ps4, and 3ds.

### 2f. User and Critic Reviews vs Total Sales

In this section, scatter plots are displayed for user and critic score vs total sales for xone, ps4, and 3ds. The correlation coefficients are then calculated between critic score, user score, and total sales for each platform, and the results are compared.

In [38]:
# Define a function to display scatter plots
def score_scatter(platform, score):
    
    """ This function takes a platform and type of score
    and returns a scatter plot of total sales vs score
    for the platform."""
    
    vg_5y_x = vg_5y[vg_5y['platform']==platform]
    
    score_scat = px.scatter(vg_5y_x, x=score, y='total_sales', hover_data=['name'],
                       title=f'Total Sales vs {score} for {platform}',
                       labels={'total_sales':f'Total {platform} Sales (USD million)'},
                       color_discrete_sequence=[px.colors.qualitative.D3[0]])

    score_scat.update_layout({
        'plot_bgcolor':'rgba(0, 0, 0, 0)',
        'paper_bgcolor':'rgba(0, 0, 0, 0)'
    })

    score_scat.update_xaxes(showgrid=False)
    score_scat.update_yaxes(showgrid=False)

    score_scat.show()

In [39]:
# Use function to print scatter plots for total sales vs 
# critic score and user score for xone, ps4, and 3ds

score_scatter('xone', 'user_score')

score_scatter('ps4', 'user_score')

score_scatter('3ds', 'user_score')

score_scatter('xone', 'critic_score')

score_scatter('ps4', 'critic_score')

score_scatter('3ds', 'critic_score')

In [40]:
# Define a function for correlations
def score_corr(platform):
    
    """This function accepts a platform name and returns
    the correlations for total_sales, user_score, and
    critic_score for that platform."""
    
    print(f'Correlations for {platform}:')
    
    vg_5y_x = vg_5y[vg_5y['platform']==platform]
    display(vg_5y_x[['total_sales','user_score', 'critic_score']].corr())

In [41]:
# Calculate correlation coefficients for xone
score_corr('xone')

# Calculate correlation coefficients for ps4
score_corr('ps4')

# Calculate correlation coefficients for 3ds
score_corr('3ds')

Correlations for xone:


Unnamed: 0,total_sales,user_score,critic_score
total_sales,1.0,-0.068925,0.416998
user_score,-0.068925,1.0,0.472462
critic_score,0.416998,0.472462,1.0


Correlations for ps4:


Unnamed: 0,total_sales,user_score,critic_score
total_sales,1.0,-0.031957,0.406568
user_score,-0.031957,1.0,0.557654
critic_score,0.406568,0.557654,1.0


Correlations for 3ds:


Unnamed: 0,total_sales,user_score,critic_score
total_sales,1.0,0.197583,0.320803
user_score,0.197583,1.0,0.722762
critic_score,0.320803,0.722762,1.0


The relationship between user score, critic score, and total sales can be seen in both the scatter plots and the correlation coefficients. Based on the scatter plots, user score does not appear to be strongly related to total sales. This is also reflected in the correlation coeffients for Xbox One (r = -0.07), PS4 (r = -0.03), and 3DS (r = 0.20), which all repersent an extremely weak relationship between user score and total sales for all three platforms. The correlations for Xbox One and PS4 are negative, which suggests that as user score goes up, total sales may actually go down. The correlation for 3DS is positive, suggesting that as user score goes up, sales also go up for that platform.

The relationship is slighly stronger between critic score and total sales on all three platforms. From the scatter plots, we can see there is a generally positive relationship between critic scores and sales, meaning that games that receive higher critic scores also tend to bring in higher total sales. This is also refelcted in the correlation coefficients for Xbox One (r = 0.42), PS4 (r = 0.41), and 3DS (r = 0.32), which represent a moderate, positive relationship between critic score and total sales on all three platforms.

Based on these plots and correlations, neither user score nor critic score are particularly strong predictors of total sales. However, comparing the two, critic score is a better predictor of total sales than user score. This means if someone wanted to predict total sales on a new game, a company would be better off making the prediciton using the score the game recieves from the critics than the score the game recieves from users.

### 2g. Total Sales by Genre

In [42]:
# Calculate mean and median of sales by genre
avg_genre = vg_5y.pivot_table(index='genre', values='total_sales', aggfunc=['mean', 'median']).round(2)

# Make pivot table into data frame
avg_genre.columns = avg_genre.columns.droplevel(1)

avg_genre.columns.name = None

avg_genre = avg_genre.reset_index().rename(columns={'mean':'mean_sales', 'median':'median_sales'})

display(avg_genre)

Unnamed: 0,genre,mean_sales,median_sales
0,Action,0.43,0.12
1,Adventure,0.1,0.03
2,Fighting,0.41,0.13
3,Misc,0.44,0.12
4,Platform,0.72,0.21
5,Puzzle,0.17,0.04
6,Racing,0.47,0.14
7,Role-Playing,0.52,0.14
8,Shooter,1.3,0.44
9,Simulation,0.44,0.12


In [43]:
# Create bar chart of mean and median of sales by genre
genre_mean_bar = px.bar(avg_genre, x='genre', y=['mean_sales', 'median_sales'], barmode='group',
                       title='Mean and Median Total Sales by Genre', 
                       labels={'value':'Total Sales (USD millions)', 'genre':'Genre'},
                       color_discrete_sequence=[px.colors.qualitative.D3[0],
                                                px.colors.qualitative.D3[3]],
                       width=1000, height=500)

genre_mean_bar.update_layout({
    'plot_bgcolor':'rgba(0, 0, 0, 0)',
    'paper_bgcolor':'rgba(0, 0, 0, 0)'
})

genre_mean_bar.update_layout(legend_title_text='Statistic', xaxis={'categoryorder':'total descending'})

genre_mean_bar.show()

In [44]:
# Create a boxplot of sales by genre
sales_genre_box = px.box(vg_5y, x='genre', y='total_sales', hover_data=['name'],
                         title='Total Sales by Genre',
                         labels={'total_sales':'Total Sales (USD millions)', 'genre':'Genre'},
                        color_discrete_sequence=[px.colors.qualitative.D3[0]],
                        width=1000, height=1200)

sales_genre_box.update_layout({
    'plot_bgcolor':'rgba(0, 0, 0, 0)',
    'paper_bgcolor':'rgba(0, 0, 0, 0)'
})

sales_genre_box.update_xaxes(showgrid=False)
sales_genre_box.update_yaxes(range=[0, 25], showgrid=False)

sales_genre_box.show()

The bar graph of mean and median total sales by genre shows that the genres with the 3 highest mean and median revenue per video game are shooter, platform, and sports.  However, by looking at the boxpolots, the genre with the single highest-performing game is action with grand theft auto v bringing in $21.05 USD million. In general, however, action tends to under perform compared to shooter, platform, and sports genres. The median, third quartile, and upper-outlier boundary for action all fall below these same values for shooter, platform, and sports. If someone were only concerned about the highest revenue game in a genre, then action, shooter, and role-playing would come out on top. However, if they are concerned about the general distribution and center of total sales, then shooter, platform, and sports come out on top. 

The genres with the lowest 3 mean and median revenue per video game are strategy, puzzle, and adventure. No matter whether we look at center or distribution of total sales by game, strategy, adventure, and puzzle come out on the bottom. They have both the lowest means and medians of any genre and also the lowest top-selling games topping out at around $1.5 USD million for the highest selling game across all three genres.

In general, businesses can expect to get the highest sales from Action, Shooter, Role-Playing, Platform, and Sports genres and the lowest sales from Strategy, Advendure, and Puzzle games.

### 2h. Descriptive Statistics Conclusion

Based on this analysis, the past 5 years from 2012 - 2016 were found to be the best suited to predict sales for 2017. These years have a similar number of video games released and are likely to catch sales from platforms that are still currently on the market. Based on these years, the most popular platforms are currently PS4, 3DS, and Xbox One. The most popular genres are shooter, platform, sports, and action world wide. Critic score was found to be a better predictor of a game's sales than user score for Xbox One, 3DS, and PS4.

## 3. Create a User Profile for Each Region

In this section, a typical user profile is created for each region, including the most popular platforms and genres by region. Additionally, the correlation between ESRB rating and regional sales is displayed by region.

### 3.1 Top 5 Platforms by Region

In [45]:
# Find total sales of each platform in each region
plat_sales_region = vg_5y.pivot_table(index='platform', values=['na_sales', 'eu_sales', 'jp_sales'], aggfunc='sum')

plat_sales_region = plat_sales_region.reset_index()

display(plat_sales_region)

Unnamed: 0,platform,eu_sales,jp_sales,na_sales
0,3ds,42.64,87.79,55.31
1,ds,3.53,3.72,4.59
2,pc,37.76,0.0,19.12
3,ps3,106.85,35.29,103.38
4,ps4,141.09,15.96,108.74
5,psp,0.42,10.47,0.13
6,psv,11.36,21.04,10.98
7,wii,11.92,3.39,17.45
8,wiiu,25.13,13.01,38.1
9,x360,74.52,1.57,140.05


In [46]:
# List top 5 platforms for each region
for col in plat_sales_region.columns[1:4]: # iterates over each column in the total_sales_region df
    col_df = plat_sales_region[['platform', col]].sort_values(
        by=col, ascending=False).reset_index(drop=True) # Sorts table of sales biggest to smallest
    
    col_df['percent_sales'] = col_df[col]/(col_df[col].sum()) # Add a column for percent of market share
    
    print(f'The top 5 platforms for {col}:')
    display(col_df[0:5]) # Display

The top 5 platforms for eu_sales:


Unnamed: 0,platform,eu_sales,percent_sales
0,ps4,141.09,0.278388
1,ps3,106.85,0.210829
2,x360,74.52,0.147037
3,xone,51.59,0.101794
4,3ds,42.64,0.084134


The top 5 platforms for jp_sales:


Unnamed: 0,platform,jp_sales,percent_sales
0,3ds,87.79,0.455862
1,ps3,35.29,0.183249
2,psv,21.04,0.109253
3,ps4,15.96,0.082875
4,wiiu,13.01,0.067556


The top 5 platforms for na_sales:


Unnamed: 0,platform,na_sales,percent_sales
0,x360,140.05,0.236983
1,ps4,108.74,0.184003
2,ps3,103.38,0.174933
3,xone,93.12,0.157571
4,3ds,55.31,0.093592


In [47]:
# Create a bar graph of sales by platform by country

# Transpose the platform sales by region df
plat_sales_region = vg_5y.pivot_table(index='platform', 
                                      values=['na_sales', 'eu_sales', 'jp_sales'], 
                                      aggfunc='sum') # original pivot table

plat_sales_region_transp = plat_sales_region.transpose().reset_index().rename(columns={'index':'region'}) # transpose & reset index

# Create a stacked bar graph
sales_bar = px.bar(plat_sales_region_transp, x='region', 
                   y=['ps4', 'ps3', '3ds', 'x360', 'xone', 'psv', 
                      'pc', 'wiiu', 'wii', 'psp', 'ds'], 
                    title='Sales by Platform by Region',
                    barmode='stack', labels={'value':'Regional Sales (USD millions)', 
                                             'region':'Region'},
                    color_discrete_sequence=px.colors.qualitative.D3,
                    width=1000, height=500) 
    
sales_bar.update_layout({
    'plot_bgcolor':'rgba(0, 0, 0, 0)',
    'paper_bgcolor':'rgba(0, 0, 0, 0)'
    })

sales_bar.update_layout(legend_title_text='Region')

sales_bar.update_xaxes(showgrid=False)
sales_bar.update_yaxes(showgrid=False)

sales_bar.show()

<b>The top 5 platforms for each region:</b>

<b>Europe:</b> 
1. ps4 (27.8%)
2. ps3 (21.1%)
3. x360 (14.7%)
4. xone (10.2%)
5. 3ds (8.4%)  

<b>Japan:</b> 
1. 3ds (45.6%)
2. ps3 (18.3%)
3. psv (10.9%)
4. ps4 (8.3%)
5. wiiu (6.8%)

<b>North America:</b> 
1. x360 (23.7%)
2. ps4 (18.4%)
3. ps3 (17.5%)
4. xone (15.8%)
5. 3ds (9.4%)

The ps3, ps4, and 3ds make the list of Top 5 Platforms in all three countries. The market share is largest for ps3 and ps4 in Europe and North America, while the 3ds has the largest share in Japan, taking almost a full half of the market share. Both x360 and xone make the list for Europe and North America but not Japan. Psv and wiiu both make the list for Japan and not for Europe or North America.

In general, ps3, ps4, and 3ds are popular across all countries. North American and European consumers are particularly into the xbox platforms, including the x360 and xone, while Japanese consumers aren't as excited about xbox. Japan's consumers are, however, very into the 3ds. The platform takes almost half of all market sales in Japan, meaning almost half of all platforms' sales revenue come from 3ds in Japan. 

### 3b. Top 5 genres by region

In [48]:
# Find total sales for each genre in each region
genre_sales_region = vg_5y.pivot_table(index='genre', values=['na_sales', 'eu_sales', 'jp_sales'], aggfunc='sum')

genre_sales_region = genre_sales_region.reset_index()

display(genre_sales_region)

Unnamed: 0,genre,eu_sales,jp_sales,na_sales
0,Action,159.34,52.8,177.84
1,Adventure,9.46,8.24,8.92
2,Fighting,10.79,9.44,19.79
3,Misc,26.32,12.86,38.19
4,Platform,21.41,8.63,25.38
5,Puzzle,1.4,2.14,1.13
6,Racing,27.29,2.5,17.22
7,Role-Playing,48.53,65.44,64.0
8,Shooter,113.47,9.23,144.77
9,Simulation,14.55,10.41,7.97


In [49]:
# List top 5 genres for each region
for col in genre_sales_region.columns[1:4]: # iterates over each column in the total_sales_region df
    col_df = genre_sales_region[['genre', col]].sort_values(
             by=col, ascending=False).reset_index(drop=True) # Sorts table of sales biggest to smallest
    
    col_df['percent_sales'] = col_df[col]/(col_df[col].sum()) # Add a column for percent of market share
    
    print(f'The top 5 platforms for {col}:')
    display(col_df[0:5])

The top 5 platforms for eu_sales:


Unnamed: 0,genre,eu_sales,percent_sales
0,Action,159.34,0.314398
1,Shooter,113.47,0.223891
2,Sports,69.08,0.136304
3,Role-Playing,48.53,0.095756
4,Racing,27.29,0.053847


The top 5 platforms for jp_sales:


Unnamed: 0,genre,jp_sales,percent_sales
0,Role-Playing,65.44,0.339807
1,Action,52.8,0.274172
2,Misc,12.86,0.066777
3,Simulation,10.41,0.054055
4,Fighting,9.44,0.049019


The top 5 platforms for na_sales:


Unnamed: 0,genre,na_sales,percent_sales
0,Action,177.84,0.300929
1,Shooter,144.77,0.24497
2,Sports,81.53,0.13796
3,Role-Playing,64.0,0.108297
4,Misc,38.19,0.064623


In [50]:
# Create a bar graph of genre by platform by country

# Transpose the genre_sales_region df
genre_sales_region = vg_5y.pivot_table(index='genre', 
                                       values=['na_sales', 'eu_sales', 'jp_sales'], 
                                       aggfunc='sum')
genre_sales_region_transp = genre_sales_region.transpose().reset_index().rename(
                            columns={'index':'region'}) # transpose & reset index

# Create a stacked bar graph
sales_bar = px.bar(genre_sales_region_transp, x='region', 
                   y=['Action', 'Shooter', 'Role-Playing', 'Sports', 'Misc', 'Platform',
                      'Racing', 'Fighting', 'Adventure', 'Simulation', 'Strategy', 'Puzzle'], 
                    title='Sales by Platform by Region', barmode='stack', 
                    labels={'value':'Regional Sales (USD millions)', 'region':'Country'},
                    color_discrete_sequence=px.colors.qualitative.D3,
                    width=1000, height=500)
    
sales_bar.update_layout({
    'plot_bgcolor':'rgba(0, 0, 0, 0)',
    'paper_bgcolor':'rgba(0, 0, 0, 0)'
    })

sales_bar.update_layout(legend_title_text='Region')

sales_bar.update_xaxes(showgrid=False)
sales_bar.update_yaxes(showgrid=False)

sales_bar.show()

<b>Top 5 Genres for each Region:</b>

<b>Europe:</b>
1. Action (31.4%)  
2. Shooter (22.4%)  
3. Sports (13.6%)  
4. Role-Playing (9.6%)  
5. Racing (5.4%)  
    
<b>Japan:</b> 
1. Role-Playing (34.0%)  
2. Action (27.4%)  
3. Misc (6.7%)
4. Simulation (5.4%)  
5. Fighting (4.9%)  

<b>North America:</b> 
1. Action (30.1%)
2. Shooter (24.5%)
3. Sports (13.8%)
4. Role-Playing (10.8%)
5. Misc (6.5%)  

In all three regions, Action and Role-Playing make the list of Top 5 Genres. Both Shooter and Sports make the list of Top 5 Genres in Europe and America, while Japan prefers Simulation and Fighting over Shooter and Sports. In Europe, Racing makes the list of top genres.

Action and Shooter combined account for over half the market share in Europe and North America. In Japan, that same top 50% is made up of Role-Playing and Action genres. Both Europe and North America have Sports and Role-Playing coming in third and forth at around 14% and 10% of market share, respectively, in each country. Japan's next highest categories are Misc and Simulation, accounting for about 7% and 6% of market share, respectively.

### 3c. Effect of ESRB Rating on Sales by Region

In [51]:
# Create pivot table of total sales by rating by region
sales_rating_region = vg_5y.pivot_table(index='rating',
                            values=['na_sales', 'jp_sales', 'eu_sales'],
                            aggfunc='sum').reset_index()

display(sales_rating_region)

Unnamed: 0,rating,eu_sales,jp_sales,na_sales
0,E,113.02,28.33,114.37
1,E10+,55.37,8.19,75.7
2,M,193.96,21.2,231.57
3,T,52.96,26.02,66.02


In [52]:
# Bar chart of sales by rating
sales_ratings_bar = px.bar(sales_rating_region, x='rating', y=['na_sales', 'eu_sales', 'jp_sales'],
                          barmode='group',
                          title='Sales by Rating by Region',
                          labels={'value':'Total Sales (USD millions)', 'rating':'ESRB Rating'},
                           color_discrete_sequence=px.colors.qualitative.D3)
    
sales_ratings_bar.update_layout({
    'plot_bgcolor':'rgba(0, 0, 0, 0)',
    'paper_bgcolor':'rgba(0, 0, 0, 0)'
    })

sales_ratings_bar.update_layout(legend_title_text='Country', xaxis={'categoryorder':'total descending'})

sales_ratings_bar.update_xaxes(showgrid=False)
sales_ratings_bar.update_yaxes(showgrid=False)


sales_ratings_bar.show()


ESRB rating does appear to have an effect on sales in each country. The highest selling genres in Noth America and Europe are M and E, while the highest selling genres in Japan are T and E. 

### 3d. User Profile Conclusion

The top platforms for Europe are PS3, PS4, and X360, for Japan they are 3DS, PSV, and PSV, and North America prefers X360, PS4, and PS3. The top genres for Europe are Action, Shooter, and Sports, for Japan they are Role-Playing, Action, and Simulation, and North American users prefer Action, Shooter, and Sports. ESRB rating does effect sales in the different regions, with M and E ratings most popular in North America and Eurpoe and T and E most popular in Japan.

## 4. Hypothesis Tests

Two formal hypothesis tests are conducted to compare the average user ratings of Xbox One vs PC platforms and Action vs Sports genres.

### 4a. Average User Ratings for Xbox One vs PC platforms

This section uses an independent samples t-test to test whether there is a significant difference in mean user ratings of Xbox One and the mean user ratings of PC platforms. 

#### 4i. Hypotheses

<div style="padding-left: 30px;">
    H<sub>0</sub>: µ<sub>xone</sub> = µ<sub>pc</sub>  The average user rating is the same on Xbox One and PC platforms.
</div>
<div style="padding-left: 30px;">
    H<sub>1</sub>: µ<sub>xone</sub> ≠ µ<sub>pc</sub>  The average user rating is different on Xbox One and PC platforms.
</div>


<div style="padding-left: 30px;">
alpha = 0.05
</div>

#### 4ii. Assumption Check - Equality of Variances

In [53]:
# Check for equality of variances with the levene's test
us_xone_complete = vg_5y['user_score'][vg_5y['platform']=='xone'].dropna() # Drop missing values
us_pc_complete = vg_5y['user_score'][vg_5y['platform']=='pc'].dropna() # Drop missing values

levene = st.levene(us_xone_complete, us_pc_complete, center='mean') # Run levene's test

print(levene) # Print levene's result
print()

print(f'The test statistic is: W = {levene[0]:.2f}.') # Specify W
print(f'The p value is: p = {levene[1]:.4f}.') # Specify p
print()

if levene[1] < 0.05: # State the test decision
    print(f'The p value of {levene[1]:.4f} is less than 0.05. '
          f'We find evidence that the two groups have different variances.')
else:
    print(f'The p value of {levene[1]:.4f} is greater than 0.05. '
          f'We find evidence that the two groups do not have different variances.')

LeveneResult(statistic=8.613824307015994, pvalue=0.003535667885746799)

The test statistic is: W = 8.61.
The p value is: p = 0.0035.

The p value of 0.0035 is less than 0.05. We find evidence that the two groups have different variances.


The p value for the levene's test for equality of variances is less than 0.05, showing evidence that Xbox One and PC platforms have different user score variances. Therefore run the t-test assuming unequal variances.

#### 4iii. Independent Samples T-Test - Unequal Variances

In [54]:
# Run the T-Test
t_xone_pc = st.ttest_ind(vg_5y['user_score'][vg_5y['platform']=='xone'],
            vg_5y['user_score'][vg_5y['platform']=='pc'],
            nan_policy='omit', equal_var=False) # omit missing values and don't assume equal variances

print(t_xone_pc) # Print the results
print()

mean_xone = vg_5y['user_score'][vg_5y['platform']=='xone'].mean().round(2) # Calculate mean for Action
mean_pc = vg_5y['user_score'][vg_5y['platform']=='pc'].mean().round(2) # Calculate mean for Sports

print(f'Average user rating for Xbox One: {mean_xone}.') # Print means
print(f'Average user rating for PC: {mean_pc}.')
print()

print(f'The test statistic is: t = {t_xone_pc[0]:.2f}') # Specify t
print(f'The p value is: p = {t_xone_pc[1]:.4f}') # Specify p
print()

if t_xone_pc[1] < 0.05:
    print(f'The p value of {t_xone_pc[1]:.4f} is less than 0.05. '
          f'Reject the null hypothesis.')
else:
    print(f'The p value of {t_xone_pc[1]:.4f} is greater than 0.05. '
          f'Do not reject the null hypothesis.')



Ttest_indResult(statistic=0.5998585993590302, pvalue=0.5489537965134987)

Average user rating for Xbox One: 6.52.
Average user rating for PC: 6.43.

The test statistic is: t = 0.60
The p value is: p = 0.5490

The p value of 0.5490 is greater than 0.05. Do not reject the null hypothesis.


#### 4iv. Conclusion

The p value of the independent samples t-test, assuming unequal variances, was greater than the stated significance level of 0.05, therefore do not reject the null hypothesis. We do not have evidence that the average user score for Xbox One is different than the average user score for PC. This can be seen in the average user rating of 6.53 for Xbox One and average user rating of 6.43 for PC. This means that, on average, users don't rate games played on the Xbox differently than they rate games played on the PC. 

### 4b. Average User Ratings for Action vs Sports Genres

This section uses another independent samples t test to test whether the average user ratings for Action genre games is different from the average user ratings for Sports genres games.

#### 4i. Hypotheses

<div style="padding-left: 30px;">
    H<sub>0</sub>: µ<sub>action</sub> = µ<sub>sports</sub>  The average user rating is the same for action and sports genres.
</div>
<div style="padding-left: 30px;">
    H<sub>1</sub>: µ<sub>action</sub> ≠ µ<sub>sports</sub>  The average user rating is different for action and sports genres.
</div>


<div style="padding-left: 30px;">
alpha = 0.05
</div>

#### 4ii. Assumption Check - Equality of Variances

In [55]:
# Check for equality of variances with the levene's test
us_action_complete = vg_5y['user_score'][vg_5y['genre']=='Action'].dropna() # Drop missing values
us_sports_complete = vg_5y['user_score'][vg_5y['genre']=='Sports'].dropna() # Drop missing values

levene_as = st.levene(us_action_complete, us_sports_complete, center='mean') # Run levene's test

print(levene_as) # Print levene's result
print()

print(f'Test statistic: W = {levene_as[0]:.2f}.') # Specify W
print(f'P value: p = {levene_as[1]:.4f}.') # Specify p
print()

if levene_as[1] < 0.05: # State the test decision
    print(f'The p value of {levene_as[1]:.4f} is less than 0.05. '
          f'We find evidence that the two groups have different variances.')
else:
    print(f'The p value of {levene_as[1]:.4f} is greater than 0.05. '
          f'We find evidence that the two groups do not have different variances.')

LeveneResult(statistic=22.107708632830413, pvalue=3.0935725794271728e-06)

Test statistic: W = 22.11.
P value: p = 0.0000.

The p value of 0.0000 is less than 0.05. We find evidence that the two groups have different variances.


The p value for the levene's test for equality of variances is less than 0.05, showing evidence that Actions and Sports genres have different user score variances. Therefore run the t-test assuming unequal variances.

#### 4iii. Independent Samples T-Test - Unequal Variances

In [56]:
# Run the T-Test
t_action_sports = st.ttest_ind(vg_5y['user_score'][vg_5y['genre']=='Action'],
            vg_5y['user_score'][vg_5y['genre']=='Sports'],
            nan_policy='omit', equal_var=False) # omit missing values and don't assume equal variances

print(t_action_sports) # Print the results
print()

mean_action = vg_5y['user_score'][vg_5y['genre']=='Action'].mean().round(2) # Calculate mean for Action
mean_sports = vg_5y['user_score'][vg_5y['genre']=='Sports'].mean().round(2) # Calculate mean for Sports

print(f'Average user rating for Action: {mean_action}.') # Print means
print(f'Average user rating for Sports: {mean_sports}.')
print()

print(f'Test statistic: t = {t_action_sports[0]:.2f}') # Specify t
print(f'P value: p = {t_action_sports[1]:.4f}') # Specify p
print()

if t_action_sports[1] < 0.05: # Rejection decision
    print(f'The p value of {t_action_sports[1]:.4f} is less than 0.05. '
          f'Reject the null hypothesis.')
else:
    print(f'The p value of {t_action_sports[1]:.4f} is greater than 0.05. '
          f'Do not reject the null hypothesis.')

Ttest_indResult(statistic=9.863487132322385, pvalue=5.98945806646755e-20)

Average user rating for Action: 6.83.
Average user rating for Sports: 5.46.

Test statistic: t = 9.86
P value: p = 0.0000

The p value of 0.0000 is less than 0.05. Reject the null hypothesis.


#### 4iv. Conclusion

The p value of the independent samples t-test assuming unequal variances was less than the stated significance level of 0.05, therefore there is evidence that the average user rating for the Action genre is different than the average user rating for the Sports genre. The average user rating for the Action genre of 6.83 is higher than the average user rating of 5.47 for the Sports genre, showing evidence that users rate Action games higher than they rate Sports games, on average. This means we have evidence that users tend to like action games more than they like sports games. 

## 5. Project Conclusion and Business Application

This analysis used data on video game sales from 2012 - 2016 to make predictions for patterns in sales in 2017. On a global level, businesses can expect that PS4, Xbox One, and 3DS will continue to be popular platforms and Action, Shooter, Role-Playing, Platform, and Sports will likely continue to be top-preforming genres. Additionally, games in the Grand Theft Auto, Call of Duty, and Fifa series will likely continue to be top-performing games.

By country, PS4, and 3DS should continue to be popular in all countries. Companies that are marketing to American or European consumers, should additionally push marketing for X360 and Xbox One platforms. If they intend to market the game in Japan, they should push additional marketing to 3DS, PSV, and Wii U. By genre, businesses can expect that Action, Shooter, Sports, and Role-Playing will be popular in Europe and North America. In Japan, Role-Playing, Action, Simulation, and Fighting will likely continue to be the top genres.

Both ESRB ratings and critic score effect total sales for games in the three regions. Games with M and E ratings are highest selling in North America and Europe and T and E are most popular in Japan. This means that businesses should consider releasing games with M and E ratings in North America and Europe and games with T and E ratings in Japan. Critic score is a better predictor of total sales than user score, meaning companies should pay more attention to the score a new game recieves from the critics than the score it recieves from the users to predict that game's sales.

According to the statistical tests, companies don't need to be worried about the platform the user plays on, they're likely to give the game the same rating, regardless of their chosen platform. However, they are likely to see a difference in user ratings based on the genre of the game. They are likely to see higher user ratings for Action games than for Sports games. 