<div style="border:solid green 4px; padding: 20px">Hello! My critical comments are highlighted with <span style='color: red;'>red</span>,  less urgent remarks are in <span style='color: #ebd731;'>yellow</span>, recommendations and extra information - in <span style='color: green;'>green</span>.</div>

# Project Description

You work for a online store that sells videogames all over the world. User and expert reviews, genres, platforms, and historical data on game sales are available from open sources. You need to identify patterns that determine whether a game succeeds or not. This allows you to put your money on a new item that's potentially hot and plan advertising campaigns.

In front of you is data going back to 2016. Let’s imagine that it’s December 2016 and you’re planning a campaign for 2017. The data set is obtained from ESRB (Entertainment Software Rating Board). The ESRB evaluates a game's content and assigns an appropriate age categories, such as Teen and Mature.

# Step 1. Open the data file and study the general information

In [4]:
# import libraries and set options
import os
import io
import numpy as np
import pandas as pd
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import plotly.tools as tls
import plotly.figure_factory as ff
import itertools
import warnings
from scipy import stats as st
from nltk.stem import WordNetLemmatizer

%matplotlib inline
warnings.filterwarnings("ignore")
pd.pandas.set_option('display.max_columns', None)

<div style="border:solid green 4px; padding: 20px">Your first cell looks pretty conventional :)</div>

In [5]:
# import the dataset
game = pd.read_csv('/datasets/games.csv')

# function for a first glance of the data
def summary(df): 
    eda_df = {}
    eda_df['null_sum'] = df.isnull().sum()
    eda_df['null_perc'] = df.isnull().mean()
    eda_df['dtypes'] = df.dtypes
    eda_df['count'] = df.count()
    eda_df['mean'] = df.mean()
    eda_df['median'] = df.median()
    eda_df['min'] = df.min()
    eda_df['max'] = df.max()
    return pd.DataFrame(eda_df)

<div style="border:solid green 4px; padding: 20px">Custom overview functions is useful indeed.</div>

In [6]:
game.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,NA_sales,EU_sales,JP_sales,Other_sales,Critic_Score,User_Score,Rating
0,Wii Sports,Wii,2006.0,Sports,41.36,28.96,3.77,8.45,76.0,8.0,E
1,Super Mario Bros.,NES,1985.0,Platform,29.08,3.58,6.81,0.77,,,
2,Mario Kart Wii,Wii,2008.0,Racing,15.68,12.76,3.79,3.29,82.0,8.3,E
3,Wii Sports Resort,Wii,2009.0,Sports,15.61,10.93,3.28,2.95,80.0,8.0,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,11.27,8.89,10.22,1.0,,,


In [7]:
summary(game)

Unnamed: 0,null_sum,null_perc,dtypes,count,mean,median,min,max
Critic_Score,8578,0.513192,float64,8137,68.967679,71.0,13.0,98.0
EU_sales,0,0.0,float64,16715,0.14506,0.02,0.0,28.96
Genre,2,0.00012,object,16713,,,,
JP_sales,0,0.0,float64,16715,0.077617,0.0,0.0,10.22
NA_sales,0,0.0,float64,16715,0.263377,0.08,0.0,41.36
Name,2,0.00012,object,16713,,,,
Other_sales,0,0.0,float64,16715,0.047342,0.01,0.0,10.57
Platform,0,0.0,object,16715,,,,
Rating,6766,0.404786,object,9949,,,,
User_Score,6701,0.400897,object,10014,,,,


In [8]:
game.describe(include='object').T

Unnamed: 0,count,unique,top,freq
Name,16713,11559,Need for Speed: Most Wanted,12
Platform,16715,31,PS2,2161
Genre,16713,12,Action,3369
User_Score,10014,96,tbd,2424
Rating,9949,8,E,3990


### Conclusion
- Dimension of the data is 16715 by 11
- By first glance, nulls appear in 'name', 'year_of_release', 'genre', 'critic_score', 'user_score', and 'rating'
- We might need to convert 'user_score' to float

# Step 2. Prepare the data
- Replace the column names (make them lowercase).
- Convert the data to the required types.
- Describe the columns where the data types have been changed and why.
- If necessary, decide how to deal with missing values:
    - Explain why you filled in the missing values as you did or why you decided to leave them blank.
    - Why do you think the values are missing? Give possible reasons.
    - Pay attention to the abbreviation TBD (to be determined) in the rating column. Specify how you intend to handle such cases.
- Calculate the total sales (the sum of sales in all regions) for each game and put these values in a separate column.

In [9]:
# replace column names
game.columns = map(str.lower, game.columns)

<div style="border:solid green 4px; padding: 20px">Great thing you know how to use map() now! Although you could just go for <i>game.colums = game.columns.str.lower()</i></div>

In [10]:
# investigate the score and rating columns since they have the highest perc of null values
print(game.user_score.unique(), '\n')
print(game.user_score.value_counts(), '\n')

print(game.critic_score.unique(), '\n')
print(game.critic_score.value_counts(), '\n')

print(game.rating.unique(), '\n')
print(game.rating.value_counts())

['8' nan '8.3' '8.5' '6.6' '8.4' '8.6' '7.7' '6.3' '7.4' '8.2' '9' '7.9'
 '8.1' '8.7' '7.1' '3.4' '5.3' '4.8' '3.2' '8.9' '6.4' '7.8' '7.5' '2.6'
 '7.2' '9.2' '7' '7.3' '4.3' '7.6' '5.7' '5' '9.1' '6.5' 'tbd' '8.8' '6.9'
 '9.4' '6.8' '6.1' '6.7' '5.4' '4' '4.9' '4.5' '9.3' '6.2' '4.2' '6' '3.7'
 '4.1' '5.8' '5.6' '5.5' '4.4' '4.6' '5.9' '3.9' '3.1' '2.9' '5.2' '3.3'
 '4.7' '5.1' '3.5' '2.5' '1.9' '3' '2.7' '2.2' '2' '9.5' '2.1' '3.6' '2.8'
 '1.8' '3.8' '0' '1.6' '9.6' '2.4' '1.7' '1.1' '0.3' '1.5' '0.7' '1.2'
 '2.3' '0.5' '1.3' '0.2' '0.6' '1.4' '0.9' '1' '9.7'] 

tbd    2424
7.8     324
8       290
8.2     282
8.3     254
       ... 
1.5       2
0.7       2
1.9       2
0         1
9.7       1
Name: user_score, Length: 96, dtype: int64 

[76. nan 82. 80. 89. 58. 87. 91. 61. 97. 95. 77. 88. 83. 94. 93. 85. 86.
 98. 96. 90. 84. 73. 74. 78. 92. 71. 72. 68. 62. 49. 67. 81. 66. 56. 79.
 70. 59. 64. 75. 60. 63. 69. 50. 25. 42. 44. 55. 48. 57. 29. 47. 65. 54.
 20. 53. 37. 38. 33. 52. 30. 32. 

In [11]:
# there are two abnormal categorical levels in the user_score field: tbd, and nan
# convert this field to numeric first, this will make things easier when grouping tbd and nan together into one class

game['user_score'] = game['user_score'].astype('category')
game['user_score_new'] = game['user_score'].cat.codes
idx1 = game.query('user_score == "tbd"').index
game.loc[idx1, 'user_score_new'] = -1
game.user_score_new = game.user_score_new.astype('float')

# assign -1 to all NA's in critic_score column
idx2 = game.query('critic_score.isnull()').index
game.loc[idx2, 'critic_score_new'] = -1
game.loc[game.index.difference(idx2), 'critic_score_new'] = game.loc[game.index.difference(idx2), 'critic_score']
game.critic_score_new = game.critic_score_new.astype('float')

del idx1, idx2

<div style="border:solid #ebd731; 4px; padding: 20px">You seem to care very much about your memory and namespace management, I am not sure it is that essential 🙂</div>

In [12]:
# now let's focus on the 'year_of_release' column
# before filling NAs with 'no_data', let's make one last effort by extract last four digits of the game name
# this is because sometimes the last 4 digits of a game name indicates a year

# target the location
idx = game.query('year_of_release.isna()').index

# extract the last 4 digits
temp = [each[-4:] for each in game.loc[idx, 'name'].values]

# filling in
a=0
for i in idx:
    game.loc[i, 'year_of_release'] = temp[a]
    a += 1
del idx

# use the 'coerce' option make sure we don't get weird extractions
game['year_of_release'] = pd.to_numeric(game['year_of_release'], errors='coerce')

# now fill in NAs with 'no_data' label
game['year_of_release'] = game['year_of_release'].fillna('no_data')

# note that there is one record that has '500' as its release year, this needs to be replaced with 'no_data'
game['year_of_release'].unique()
game.loc[game['year_of_release'] == 500, 'year_of_release'] = 'no_data'

<div style="border:solid green 4px; padding: 20px">It does bring us that much in terms of filling records, but this is certainly a great idea.</div>

In [13]:
# treat NA's in the rating column
game['rating'] = game['rating'].fillna('no_data')

# get rid off the old columns
user_score, critic_score = game['user_score'], game['critic_score']   #make copy first
game.drop(columns=['user_score','critic_score'], inplace=True)

# get rid off the instances without a name and genre
game.dropna(subset=['name','genre'], inplace=True)
    
# calculate the total sales for each game
game['total_sales'] = game['na_sales'] + game['eu_sales'] + game['jp_sales'] + game['other_sales']

In [14]:
game.head()

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,rating,user_score_new,critic_score_new,total_sales
0,Wii Sports,Wii,2006,Sports,41.36,28.96,3.77,8.45,E,77.0,76.0,82.54
1,Super Mario Bros.,NES,1985,Platform,29.08,3.58,6.81,0.77,no_data,-1.0,-1.0,40.24
2,Mario Kart Wii,Wii,2008,Racing,15.68,12.76,3.79,3.29,E,80.0,82.0,35.52
3,Wii Sports Resort,Wii,2009,Sports,15.61,10.93,3.28,2.95,E,77.0,80.0,32.77
4,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,11.27,8.89,10.22,1.0,no_data,-1.0,-1.0,31.38


In [15]:
summary(game)

Unnamed: 0,null_sum,null_perc,dtypes,count,mean,median,min,max
critic_score_new,0,0.0,float64,16713,33.06492,-1.0,-1,98
eu_sales,0,0.0,float64,16713,0.145045,0.02,0,28.96
genre,0,0.0,object,16713,,,Action,Strategy
jp_sales,0,0.0,float64,16713,0.077625,0.0,0,10.22
na_sales,0,0.0,float64,16713,0.263302,0.08,0,41.36
name,0,0.0,object,16713,,,Beyblade Burst,¡Shin Chan Flipa en colores!
other_sales,0,0.0,float64,16713,0.047343,0.01,0,10.57
platform,0,0.0,object,16713,,,2600,XOne
rating,0,0.0,object,16713,,,AO,no_data
total_sales,0,0.0,float64,16713,0.533315,0.17,0,82.54


In [16]:
game.describe(include=['category','object']).T

Unnamed: 0,count,unique,top,freq
name,16713,11559,Need for Speed: Most Wanted,12
platform,16713,31,PS2,2161
year_of_release,16713,38,2008,1429
genre,16713,12,Action,3369
rating,16713,9,no_data,6764


### Conclusion
- The biggest challenges remain in the 'user_score' column. 40% of the values are NA. The column itself is imported as object originally, and there are 90+ levels within this column. In addition, we have a special class named 'tbd'. 
    - Our first challenge is that if we convert this column to numeric entirely, then what value is appropriate to use to replace the NAs and 'tbd' cells? Note that we can't fill in with 0s since 0 has a meaning, it would deflate the impact of certain games.
    - Another challenge is that if we were to keep the 'user_score' column as object, then we could obviously encode the NAs and 'tbd's to be in one group, but how can effectively reduce the number of unique levels as there are 90+ of them.
    - With those two motives in mind, I decided to convert the 'user_score' column to numeric, and encode the NAs and 'tbs' to -1 to represent the lack of data. 
    - To do this, I used cat.code. There are three benefits of using cat.code. First, all NA's and 'tbd' can be grouped into -1 as we previously mentioned. Second, category ordinality is preserved. Third, user score now will be on the same scale as the critic score does. 
- After applying cat.codes to the 'user_score' column, I assigned -1 to all NAs in 'critic_score' column as well so that it will have the same null representation as the user_score column does.
- 40% of values in column 'rating' are NA, we replace these with 'no_data'. This is ok since the 'rating' column is categorical.
- For the records that don't have year info, I tried to get the last 4 digits of their name. This is because sometimes a game put year at the very end to indicate the the version of this game. After that, filling in NAs in the 'release_year' column with 'no_data' and treat this column as categorical in our rest of analysis.
- Deleting the records that don't have a name and a genre.

<div style="border:solid green 4px; padding: 20px">Well done.</div>

# Step 3. Analyze the data
- Look at how many games were released in different years. Is the data for every period significant?
- Look at how sales varied from platform to platform. Choose the platforms with the greatest total sales and build a distribution based on data for each year. Find platforms that used to be popular but now have zero sales. How long does it generally take for new platforms to appear and old ones to fade?
- Determine what period you should take data for. To do so, look at your answers to the previous questions. The key criterion is that the data should allow you to build a prognosis for 2017.
- Work only with the data that you've decided is relevant. Disregard the data for previous years.
- Which platforms are leading in sales? Which ones are growing or shrinking? Select several potentially profitable platforms.
- Build a box plot for the global sales of each game, broken down by platform. Are the differences in sales significant? What about average sales on various platforms? Describe your findings.
- Take a look at how user and professional reviews affect sales for a particular popular platform. Build a scatter plot and calculate the correlation between reviews and sales. Draw conclusions.
- Keeping your conclusions in mind, compare the sales of the same games on other platforms.
- Take a look at the general distribution of games by genre. What can we say about the most profitable genres? Can you generalize about genres with high and low sales?

In [17]:
# function for histogram of number of games released in each year
def histogram(df, column, main):
    trace1 = go.Histogram(x = df[column],
                          #histnorm= "percent",
                          name = "Group1",
                          marker = dict(line = dict(width = .5, color = "black")),
                          opacity = .9
                         )
    # this part can be used when building a histogram for two groups
    #trace2 = go.Histogram(x  = game[column],
     #                     histnorm = "percent",
      #                    name = "Group2",
       #                   marker = dict(line = dict(width = .5, color = "black")),
        #                  opacity = .9
         #                )
    data = [trace1]#,trace2]
    layout = go.Layout(dict(title = main,
                            plot_bgcolor  = "rgb(243,243,243)",
                            paper_bgcolor = "rgb(243,243,243)",
                            xaxis = dict(gridcolor = 'rgb(255,255,255)',
                                             title = column,
                                             zerolinewidth = 1,
                                             ticklen = 5,
                                             gridwidth = 2),
                            yaxis = dict(gridcolor = 'rgb(255,255,255)',
                                             title = "Count",
                                             zerolinewidth = 1,
                                             ticklen = 5,
                                             gridwidth = 2),
                           )
                      )
    fig  = go.Figure(data=data,layout=layout)
    py.iplot(fig)

In [18]:
# Look at how many games were released in different years
# Is the data for every period significant

histogram(game, 'year_of_release', 'Number of Games Released in Each Year')

It is observed that the gaming industry experienced a production peak from 2002 to 2011. While we can't test significance on game production for all individual years, we can divide the data into two groups, and test whether the average number of games produced from 2002 to 2011 is different from the average production in the rest time period we have. 

In [19]:
# Testing the significance on average number of game released between 2002 and 2011 (inclusive)
# We first need to isolate the records with no info on year_of_release

temp = game.query('year_of_release != "no_data"')
temp.year_of_release = temp.year_of_release.astype('float')
temp1 = temp.query('2002<=year_of_release<=2011')
temp1 = temp1.groupby('year_of_release').count()['name']
temp2 = temp.loc[temp.index.difference(temp1.index),]
temp2 = temp2.groupby('year_of_release').count()['name']

# testing for equal variance before carrying out the 2-sample t-test
F = np.var(temp1) / np.var(temp2)
df1,df2 = len(temp1) - 1, len(temp2) - 1
p_value = st.f.cdf(F, df1, df2)
alpha = 0.05

if p_value > alpha:
    print('Failed to reject the null hypothesis, hence Var(ultimate) == Var(surf)', '\n')
else:
    print('Have sufficient significance to reject the null hypothesis, hence Var(ultimate) != Var(surf)', '\n')

# Hpothesis testing - average number of games released between 2002 and 2011 (inclusive) and average number of games
# release during all other period of time

results = st.ttest_ind(temp1, temp2, equal_var=False)
print('p-value:', results.pvalue)
if results.pvalue < alpha:
      print('Have sufficient significance to reject the null hypothesis, hence the average number of game released between the two population is different')
else:
      print('Failed to reject the null hypothesis, hence the average number of game released between the two population is the same')
del temp, temp1, temp2

Have sufficient significance to reject the null hypothesis, hence Var(ultimate) != Var(surf) 

p-value: 3.6510737590801753e-06
Have sufficient significance to reject the null hypothesis, hence the average number of game released between the two population is different


<div style="border:solid green 4px; padding: 20px">You didn't forget to check the variances equality, very good!</div>

In [20]:
# Look at how sales varied from platform to platform. Then Choose the platforms with the greatest total sales 

# function for data
def single_groupby_bar(df, column, target):
    grouped = df.groupby(column)[target].sum().reset_index()
    tracer = go.Bar(x = grouped[column],
                    y = grouped[target],
                    name = target, marker = dict(line = dict(width = 1,
                                                             color = "#A9A9A9")),
                    text = column+', sales',
                    opacity = .9
                   )
    return tracer

# function for layout
def layout_plot(title, xaxis_lab, yaxis_lab) :
    layout = go.Layout(dict(title = title,
                            plot_bgcolor  = "rgb(243,243,243)",
                            paper_bgcolor = "rgb(243,243,243)",
                            xaxis = dict(gridcolor = 'rgb(255,255,255)',title = xaxis_lab,
                                         zerolinewidth=1,ticklen=5,gridwidth=2),
                            yaxis = dict(gridcolor = 'rgb(255,255,255)',title = yaxis_lab,
                                         zerolinewidth=1,ticklen=5,gridwidth=2),
                           )
                      )
    return layout

trace  = single_groupby_bar(game, 'platform', 'total_sales')
layout = layout_plot("Total Sales by Platform", "Platform", "Total Sales")
fig    = go.Figure(data=trace, layout=layout)
py.iplot(fig)

<div style="border:solid green 4px; padding: 20px">Great use of plotly. For better understanding I recommend to sort this one descendingly.</div>

It's observed that PS2 is the most popular platform in terms of it having the largest total sales.

In [21]:
# Based on observations, we will choose 'PS2' as our main platform to analyze
# Next, let's build a distribution based on data from that platform for each year

trace  = single_groupby_bar(game.query('platform=="PS2"'), 'year_of_release', 'total_sales')
layout = layout_plot("PS2 Total Sales by Year", "Year", "Total Sales")
fig    = go.Figure(data=trace, layout=layout)
py.iplot(fig)

In [22]:
# Find platforms that used to be popular but now have zero sales. 
# How long does it generally take for new platforms to appear and old ones to fade?

print('Based on the histogram above, we conclude that PS2 is one of those "used to be popular" platform that now have very low sales.') 
print('The entire lifecyle of PS2 is 10 years, but it took 3-4 years to became popular and took another 7 years to fade off.')

Based on the histogram above, we conclude that PS2 is one of those "used to be popular" platform that now have very low sales.
The entire lifecyle of PS2 is 10 years, but it took 3-4 years to became popular and took another 7 years to fade off.


In order to determine what period we should take data for, we need to go back to our first histogram which is about the number of games released in each year. It is observed that there are 482 games released in 2001 which is rather similar to that of 2016. 

In addition, the most recent peak happened during 2008 and 2009. Before 2008, the only trend that we are seeing is there are more and more games being released in each year. However, that major trend after 2009 has shifted to decreasing. We are seeing that there are less and less number of games being released in each year after 2009. 

With all that said, if our goal is to build a prognosis for 2017 without the intervine of predictive modeling, we need to find a previous year with the same level of game production as of 2016 and use that as a threshold to filter out the data that's associated with years before that threshold. The remaining portion will be our basis to make a prediction for 2017.  

Thus, we conclude that 2001 will be the threshold. This makes sense becuase the data before 2001 are far away from 2016 and they are less representative in terms of predicting 2017.  

In [23]:
# Work only with the data that you've decided is relevant. Disregard the data for previous years.

temp = np.arange(2001,2017,1).tolist()
game2 = game.query('year_of_release in @temp')
del temp

In [24]:
# Which platforms are leading in sales? Which ones are growing or shrinking? Select several potentially profitable platforms.

# To solve this, frist we need to create a year range as a new varialbe then grouping the data by both platform and year range.
# Since we have a total of 16 years remain the year_of_release column, I will divide it into 4 groups. 
# This will help us later in the groupby.

bins = [2001, 2004, 2008, 2012, 2016]
labels = ['01_04', '05_08', '09_12', '13_16']
game2['year_of_release_new'] = pd.cut(game2.year_of_release, bins, labels = labels, include_lowest = True, right = True)
game2.year_of_release_new = game2.year_of_release_new.astype(str)

# Here we will only select the top 5 most popular platforms in terms of having the largest total sales.
pop_plats = game2.groupby('platform')['total_sales'].sum().sort_values(ascending=False).index[:5].tolist()

In [25]:
def double_groupby_bar(df, g1, g2, target, value):
    temp = df.groupby([g1, g2])[target].sum().reset_index()
    tracer = go.Bar(x = temp[temp[g1] == value][g2],
                    y = temp[temp[g1] == value][target],
                    name = value, 
                    marker = dict(line = dict(width = 1)),
                    text = '(year range, sales)'
                   )
    return tracer

trace1  = double_groupby_bar(game2, 'platform', 'year_of_release_new', 'total_sales', pop_plats[0])
trace2  = double_groupby_bar(game2, 'platform', 'year_of_release_new', 'total_sales', pop_plats[1])
trace3  = double_groupby_bar(game2, 'platform', 'year_of_release_new', 'total_sales', pop_plats[2])
trace4  = double_groupby_bar(game2, 'platform', 'year_of_release_new', 'total_sales', pop_plats[3])
trace5  = double_groupby_bar(game2, 'platform', 'year_of_release_new', 'total_sales', pop_plats[4])
layout = layout_plot('Top Platforms Total Sales by Year Range','Year Range','Total Sales')
data   = [trace1,trace2,trace3,trace4,trace5]
fig    = go.Figure(data=data, layout=layout)
py.iplot(fig)

I just realized that this type of graph isn't particularly useful for the problem we have, I will switch to a line graph since it is better to observe a trend.

In [26]:
def double_groupby_line(df, g1, g2, target, value):
    temp = df.groupby([g1, g2])[target].sum().reset_index()
    tracer = go.Scatter(x = temp[temp[g1] == value][g2],
                        y = temp[temp[g1] == value][target],
                        name = value, mode = 'lines',
                        marker = dict(line = dict(width = 1)), # color = 'red'),
                        text = '(year, sales)',
                        connectgaps=True
                       )
    return tracer

trace1  = double_groupby_line(game2, 'platform', 'year_of_release', 'total_sales', pop_plats[0])
trace2  = double_groupby_line(game2, 'platform', 'year_of_release', 'total_sales', pop_plats[1])
trace3  = double_groupby_line(game2, 'platform', 'year_of_release', 'total_sales', pop_plats[2])
trace4  = double_groupby_line(game2, 'platform', 'year_of_release', 'total_sales', pop_plats[3])
trace5  = double_groupby_line(game2, 'platform', 'year_of_release', 'total_sales', pop_plats[4])
layout = layout_plot('Top Platforms Total Sales by Year','Year','Total Sales')
data   = [trace1,trace2,trace3,trace4,trace5]
fig    = go.Figure(data=data, layout=layout)
py.iplot(fig)

From this graph, it is observed that:
- The top 5 most popular platforms, after 2000, are PS2, X360, PS3, Wii, DS.
- PS2 had been the most popular platform in the past, in terms of total sales.
- DS only became popular from 2005 to 2008, then it started shrinking to the point that we don't see any sales of that platform in 2016.
- Out of the five most popular platform, there are three that still remains avaliable in the market. They are Wii, X360 and PS3. Currently they all have very low sales in year 2016. I suspect that one potential reason could be that our data only give us info through early 2016. 
- However, if that is not the case, then we will need to further investigate into the successors of thoes platforms and see how their sales is doing compared to their predecessors in 2016.

<div style="border:solid green 4px; padding: 20px">Very well!</div>

In [27]:
pop_plats = game2.groupby('platform')['total_sales'].sum().sort_values(ascending=False).index[5:10].tolist()
trace1  = double_groupby_line(game2, 'platform', 'year_of_release', 'total_sales', pop_plats[0])
trace2  = double_groupby_line(game2, 'platform', 'year_of_release', 'total_sales', pop_plats[1])
trace3  = double_groupby_line(game2, 'platform', 'year_of_release', 'total_sales', pop_plats[2])
trace4  = double_groupby_line(game2, 'platform', 'year_of_release', 'total_sales', pop_plats[3])
trace5  = double_groupby_line(game2, 'platform', 'year_of_release', 'total_sales', pop_plats[4])
layout = layout_plot('Next Top Platforms Total Sales by Year','Year','Total Sales')
data   = [trace1,trace2,trace3,trace4,trace5]
fig    = go.Figure(data=data, layout=layout)
py.iplot(fig)

- The next 5 popular platforms are PS4, GBA, PSP, 3DS, and XB.
- Recall that PS3, X360 and Wii are historically popular platforms, but they all have very low total sales in 2016. So if our goal is to predict the next popular platform that has high sales in the future, we would naturally pay close attention to successors of those three platforms. 
- Out of the next 5 popular platforms, we notice that PS4 is the successor of PS3 and PSP is a related product as 3DS is related to Wii.
- As our line graph shows, PS4 and 3DS are our top bets. However, if having only two options is considered limited, we can continue on the list and keep searching other products that are related to the PS, XBox, and Wii series.

In [28]:
pop_plats = ['PC','XOne','WiiU','PSV']
trace1  = double_groupby_line(game2, 'platform', 'year_of_release', 'total_sales', pop_plats[0])
trace2  = double_groupby_line(game2, 'platform', 'year_of_release', 'total_sales', pop_plats[1])
trace3  = double_groupby_line(game2, 'platform', 'year_of_release', 'total_sales', pop_plats[2])
trace4  = double_groupby_line(game2, 'platform', 'year_of_release', 'total_sales', pop_plats[3])
layout = layout_plot('Next Top Platforms Total Sales by Year','Year','Total Sales')
data   = [trace1,trace2,trace3,trace4]
fig    = go.Figure(data=data, layout=layout)
py.iplot(fig)

- Based on observation, let's pick XOne as our leading platform along with PS4, and 3DS. 
- To answer the question that which ones are leading in sales. They are PS4, 3DS, and XOne.
- To answer the question that which ones are growing or shrinking. I think that all platforms are shrinking since the whole industry has been experiencing downturns since 2009. A direct impact because of that is we are seeing less number of platforms out there in the market. 
- Some historically popular platforms such as PS2, DS and PSP, have faded due to outdated technolgy. Other platforms like PC has been around for the longest period of time but its sales stay relatively low comparing to other platforms. There are platforms like WiiU and PSV which are designed to be portable. But they never became mainstream. 
- Therefore, as a conclusion, the only platforms that are proven to be future proof is the PS, XBox and Wii series. I would strongly recommend to pay close attention to the newest products of these platforms.

In [29]:
# Build a box plot for the global sales of each game, broken down by platform. 

x_data = ['Wii', 'X360', 'XOne', 'PS4', '3DS']
y1 = game2.query('platform==@x_data[0]')['total_sales'][game2.query('platform==@x_data[0]')['total_sales']<5] #add a limit to
y2 = game2.query('platform==@x_data[1]')['total_sales'][game2.query('platform==@x_data[1]')['total_sales']<5] #avoid scale 
y3 = game2.query('platform==@x_data[2]')['total_sales'][game2.query('platform==@x_data[2]')['total_sales']<5] #problem caused by
y4 = game2.query('platform==@x_data[3]')['total_sales'][game2.query('platform==@x_data[3]')['total_sales']<5] #outliers
y5 = game2.query('platform==@x_data[4]')['total_sales'][game2.query('platform==@x_data[4]')['total_sales']<5]
y_data = [y1, y2, y3, y4, y5]
colors = ['rgba(93, 164, 214, 0.5)', 'rgba(255, 144, 14, 0.5)', 'rgba(44, 160, 101, 0.5)', 'rgba(255, 65, 54, 0.5)', 
          'rgba(207, 114, 255, 0.5)']

fig = go.Figure()
for xd, yd, cls in zip(x_data, y_data, colors):
        fig.add_trace(go.Box(
            y=yd,
            name=xd,
            boxpoints='outliers',
            jitter=0.5,
            whiskerwidth=0.2,
            fillcolor=cls,
            marker_size=2,
            line_width=1)
        )

fig.update_layout(
    title='Total Sales of All Games - Broken Down by Platform',
    yaxis=dict(
        autorange=True,
        showgrid=True,
        zeroline=True,
        dtick=5,
        gridcolor='rgb(255, 255, 255)',
        gridwidth=1,
        zerolinecolor='rgb(255, 255, 255)',
        zerolinewidth=2,
    ),
    margin=dict(
        l=40,
        r=30,
        b=80,
        t=100,
    ),
    paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)',
    showlegend=True
)

fig.show()

<div style="border:solid green 4px; padding: 20px">Would be also really fun to add game names as labels, that appear on hover, when selecting a point.</div>

Based on our conclusion for previous question, I chose 5 popular platforms, three of them are the current leading ones and rest two has been historically popular.

In [30]:
# Are the differences in sales significant? What about average sales on various platforms?
# We will One-way ANOVA to answer the question.

y1 = game2.query('platform==@x_data[0]')['total_sales'] 
y2 = game2.query('platform==@x_data[1]')['total_sales'] 
y3 = game2.query('platform==@x_data[2]')['total_sales'] 
y4 = game2.query('platform==@x_data[3]')['total_sales'] 
y5 = game2.query('platform==@x_data[4]')['total_sales']

list_of_tuples= list(zip(y1,y2,y3,y4,y5))
st.f_oneway(y1,y2,y3,y4,y5)

F_onewayResult(statistic=1.598502106541611, pvalue=0.17184272805580697)

<div style="border:solid green 4px; padding: 20px">Looks like you are totally comfortable with scipy.stats usage as well as with statistical tests and methods. That's great!</div>

- We have a pvalue of 0.1718 and F-statistic of 1.5985. Normally, we would need a high F-statistic (greater than 3.68) to conclude a significance level of 5%. So give the results we have, we can not reject the null hypothesis, hence the average of total sales across all 5 selected platforms came from the same population. 
- In other words, there is no statistical significance on total sales from either one of popular platforms. There is no clear winner among X360, Wii, PS3, PS4, 3DS.
- Please note that these groups all have different sizes and variances. This indicates that the power to detect group differences will not track the mean group size, but instead will track more closely the harmonic mean of the group sizes which usually implies lower than expected statistical power.
- In addition, the resulted F-statistic is likely to be biased. Whether the bias is positive or negative depends on the sample variance of the larger vs. the smaller size groups in our data set.

In [31]:
# Build a scatter plot to look at how user and professional reviews affect sales for a particular popular platform. 
# To make better visualization, I will include multiple platforms to the scatterplot then examine.

x_data = ['Wii', 'X360', 'XOne', 'PS4', '3DS']
def plot_scatter(df, column, x, y, value, color) :
    tracer = go.Scatter(x = df[df[column] == value][x],
                        y = df[df[column] == value][y] [df[df[column] == value][y] < 5],
                        mode = "markers",marker = dict(line = dict(color = "black",
                                                                   width = .2),
                                                       size = 4 , color = color,
                                                       symbol = "diamond-dot",
                                                      ),
                        name = value,
                        opacity = .9
                       )
    return tracer

trace1 = plot_scatter(game2, 'platform', 'critic_score_new', 'total_sales', x_data[0], 'green')
trace2 = plot_scatter(game2, 'platform', 'critic_score_new', 'total_sales', x_data[1], 'yellow')
trace3 = plot_scatter(game2, 'platform', 'critic_score_new', 'total_sales', x_data[2], 'red')
trace4 = plot_scatter(game2, 'platform', 'critic_score_new', 'total_sales', x_data[3], 'blue')
trace5 = plot_scatter(game2, 'platform', 'critic_score_new', 'total_sales', x_data[4], 'grey')
data   = [trace1,trace2,trace3,trace4,trace5] 
layout= layout_plot('Critic Score vs. Total Sales by Top Platforms','Critic Score', 'Total Sales')
fig= go.Figure(data= data, layout= layout)
py.iplot(fig)

This graph at a first glance seems intimidating, please click on the values on the legend to isolate certain groups we choose
- Single click will exclude that group
- Double click will choose only that group
- Double click on any to reset

In [32]:
trace1 = plot_scatter(game2, 'platform', 'user_score_new', 'total_sales', x_data[0], 'green')
trace2 = plot_scatter(game2, 'platform', 'user_score_new', 'total_sales', x_data[1], 'yellow')
trace3 = plot_scatter(game2, 'platform', 'user_score_new', 'total_sales', x_data[2], 'red')
trace4 = plot_scatter(game2, 'platform', 'user_score_new', 'total_sales', x_data[3], 'blue')
trace5 = plot_scatter(game2, 'platform', 'user_score_new', 'total_sales', x_data[4], 'grey')
data   = [trace1,trace2,trace3,trace4,trace5] 
layout= layout_plot('User Score vs. Total Sales by Top Platforms','User Score', 'Total Sales')
fig= go.Figure(data= data, layout= layout)
py.iplot(fig)

- Based on the two graphs we made above, critic scores are better to be used to form a association with sales, so instead of using user scores we will choose critic scores as our referrence. 
- It's observed that the higher the critic scores, the more variaty of names we see in the market and thus the higher sales one potentially may reach. 
- This is not to say that critic score is a direct causation to higher or lowered sales, but rather, there are only so few of the games that receives low critic scores; consequentially, thoes ones are more likely to have bad sales. For the majority of games that have relatively normal critic scroes, they are more likely to have not so bad sales. 

<div style="border:solid green 4px; padding: 20px">So, yes, we are not allowed to detect a direct causation, it depends on more than one factor.</div>

In [33]:
# Calculate the correlation between reviews and sales

game2[['critic_score_new', 'total_sales']].corr()

Unnamed: 0,critic_score_new,total_sales
critic_score_new,1.0,0.196746
total_sales,0.196746,1.0


In [34]:
# Keeping your conclusions in mind, compare the sales of the same games on other platforms
# To make comparison, we need to first seperate out into different dataframes according to platform

all_platforms = []
for each in game2.platform.unique().tolist():
    all_platforms.append(game2.query('platform == @each')[['name','platform','year_of_release','total_sales']])

# find games that appear on more than one platforms

temp = game2[game2.duplicated(subset=['name'])].groupby(['year_of_release','name','platform'])['total_sales'].sum().reset_index()

# for thoes games that appear on multiple platforms, concatenate them into one dataset along with their sales from other platforms
# Then drop the games that only sell on one platform

for i in range(len(all_platforms)):
    all_platforms[i] = all_platforms[i].merge(temp, on='name').sort_values(by='name')
    all_platforms[i] = all_platforms[i][all_platforms[i].duplicated(subset=['name'], keep=False)].sort_values(by='total_sales_x', ascending=False)

- What we have done here is to first separate out multiple subsets of the original dataframe according to platforms.
- Then, among all games, find the ones that appear on multiple platforms.
- Lastly, for each subset, merge with results from step2 on 'name' column, then drop all the unique records. This way, we will have only duplicates in each final subsets.
- To make the comparison easier, we will choose three games that represent the highest, the median and the lowest sales, from the current leading platforms. Then determine whether these games' sales are any different from other platforms.

In [35]:
# Current leading platforms: ['XOne', 'PS4', '3DS']

In [36]:
# games chosen from XOne

display(all_platforms[11][all_platforms[11]['total_sales_x'] == all_platforms[11].total_sales_x.max()],'above is the most popular games on XOne')
display(all_platforms[11][all_platforms[11]['total_sales_x'] == all_platforms[11].total_sales_x.median()],'above is medium popular games on XOne')
display(all_platforms[11][all_platforms[11]['total_sales_x'] == all_platforms[11].total_sales_x.min()],'above is the least popular games on XOne')

Unnamed: 0,name,platform_x,year_of_release_x,total_sales_x,year_of_release_y,platform_y,total_sales_y
0,Call of Duty: Black Ops 3,XOne,2015,7.39,2015.0,PC,0.26
3,Call of Duty: Black Ops 3,XOne,2015,7.39,2015.0,XOne,7.39
1,Call of Duty: Black Ops 3,XOne,2015,7.39,2015.0,PS3,1.69
2,Call of Duty: Black Ops 3,XOne,2015,7.39,2015.0,X360,1.7


'above is the most popular games on XOne'

Unnamed: 0,name,platform_x,year_of_release_x,total_sales_x,year_of_release_y,platform_y,total_sales_y
307,Just Dance 2016,XOne,2015,0.32,2015.0,X360,0.31
308,Just Dance 2016,XOne,2015,0.32,2015.0,XOne,0.32
306,Just Dance 2016,XOne,2015,0.32,2015.0,WiiU,0.57
309,Dragon Ball: XenoVerse,XOne,2015,0.32,2015.0,PS3,0.49
310,Dragon Ball: XenoVerse,XOne,2015,0.32,2015.0,X360,0.23
311,Dragon Ball: XenoVerse,XOne,2015,0.32,2015.0,XOne,0.32
304,Just Dance 2016,XOne,2015,0.32,2015.0,PS3,0.19
305,Just Dance 2016,XOne,2015,0.32,2015.0,PS4,0.36


'above is medium popular games on XOne'

Unnamed: 0,name,platform_x,year_of_release_x,total_sales_x,year_of_release_y,platform_y,total_sales_y
631,Assassin's Creed Chronicles,XOne,2016,0.01,2016.0,PSV,0.07
632,Assassin's Creed Chronicles,XOne,2016,0.01,2016.0,XOne,0.01
617,WRC 5: FIA World Rally Championship,XOne,2015,0.01,2015.0,PSV,0.03
598,Arslan: The Warriors of Legend,XOne,2016,0.01,2015.0,PS3,0.05
630,Rugby League Live 3,XOne,2015,0.01,2015.0,XOne,0.01
599,Arslan: The Warriors of Legend,XOne,2016,0.01,2016.0,XOne,0.01
640,ZombiU,XOne,2016,0.01,2016.0,PS4,0.06
629,Rugby League Live 3,XOne,2015,0.01,2015.0,X360,0.02
637,Rugby Challenge 3,XOne,2016,0.01,2016.0,PS3,0.02
628,Rugby League Live 3,XOne,2015,0.01,2015.0,PS3,0.01


'above is the least popular games on XOne'

<div style="border:solid green 4px; padding: 20px">Good thing you are using display() instead of print().</div>

In [37]:
# games chosen from PS4

display(all_platforms[6][all_platforms[6]['total_sales_x'] == all_platforms[6].total_sales_x.max()],'above is the most popular games on PS4')
display(all_platforms[6][all_platforms[6]['total_sales_x'] == all_platforms[6].total_sales_x.median()],'above is medium popular games on PS4')
display(all_platforms[6][all_platforms[6]['total_sales_x'] == all_platforms[6].total_sales_x.min()],'above is the least popular games on PS4')

Unnamed: 0,name,platform_x,year_of_release_x,total_sales_x,year_of_release_y,platform_y,total_sales_y
1,Call of Duty: Black Ops 3,PS4,2015,14.63,2015.0,PS3,1.69
3,Call of Duty: Black Ops 3,PS4,2015,14.63,2015.0,XOne,7.39
2,Call of Duty: Black Ops 3,PS4,2015,14.63,2015.0,X360,1.7
0,Call of Duty: Black Ops 3,PS4,2015,14.63,2015.0,PC,0.26


'above is the most popular games on PS4'

Unnamed: 0,name,platform_x,year_of_release_x,total_sales_x,year_of_release_y,platform_y,total_sales_y
351,The Amazing Spider-Man 2 (2014),PS4,2014,0.56,2014.0,PS3,0.26
353,The Amazing Spider-Man 2 (2014),PS4,2014,0.56,2014.0,X360,0.23
349,The Amazing Spider-Man 2 (2014),PS4,2014,0.56,2014.0,3DS,0.07
350,The Amazing Spider-Man 2 (2014),PS4,2014,0.56,2014.0,PC,0.01
352,The Amazing Spider-Man 2 (2014),PS4,2014,0.56,2014.0,WiiU,0.05
354,The Amazing Spider-Man 2 (2014),PS4,2014,0.56,2014.0,XOne,0.22


'above is medium popular games on PS4'

Unnamed: 0,name,platform_x,year_of_release_x,total_sales_x,year_of_release_y,platform_y,total_sales_y
777,Chaos;Child,PS4,2015,0.01,2015.0,PS4,0.01
776,Chaos;Child,PS4,2015,0.01,2015.0,PS3,0.01


'above is the least popular games on PS4'

In [38]:
# games chosen from 3DS

display(all_platforms[7][all_platforms[7]['total_sales_x'] == all_platforms[7].total_sales_x.max()],'above is the most popular games on 3DS')
display(all_platforms[7][all_platforms[7]['total_sales_x'] == all_platforms[7].total_sales_x.median()],'above is medium popular games on 3DS')
display(all_platforms[7][all_platforms[7]['total_sales_x'] == all_platforms[7].total_sales_x.min()],'above is the least popular games on 3DS')

Unnamed: 0,name,platform_x,year_of_release_x,total_sales_x,year_of_release_y,platform_y,total_sales_y
1,Monster Hunter Tri,3DS,2011,2.78,2009.0,Wii,2.21
2,Monster Hunter Tri,3DS,2011,2.78,2012.0,WiiU,0.69


'above is the most popular games on 3DS'

Unnamed: 0,name,platform_x,year_of_release_x,total_sales_x,year_of_release_y,platform_y,total_sales_y
185,The Amazing Spider-Man (Console Version),3DS,2012,0.19,2012.0,DS,0.19
184,The Amazing Spider-Man (Console Version),3DS,2012,0.19,2012.0,3DS,0.19
186,The Amazing Spider-Man (Console Version),3DS,2012,0.19,2012.0,Wii,0.15
205,Pro Evolution Soccer 2014,3DS,2013,0.19,2013.0,X360,0.24
204,Pro Evolution Soccer 2014,3DS,2013,0.19,2013.0,PSP,0.16
187,The Amazing Spider-Man (Console Version),3DS,2012,0.19,2012.0,X360,0.73
202,Pro Evolution Soccer 2014,3DS,2013,0.19,2013.0,3DS,0.19
203,Pro Evolution Soccer 2014,3DS,2013,0.19,2013.0,PC,0.11
200,NASCAR Unleashed,3DS,2011,0.19,2011.0,Wii,0.18
199,NASCAR Unleashed,3DS,2011,0.19,2011.0,PS3,0.1


'above is medium popular games on 3DS'

Unnamed: 0,name,platform_x,year_of_release_x,total_sales_x,year_of_release_y,platform_y,total_sales_y
389,Turbo: Super Stunt Squad,3DS,2013,0.01,2013.0,PS3,0.01
388,Turbo: Super Stunt Squad,3DS,2013,0.01,2013.0,3DS,0.01
392,Turbo: Super Stunt Squad,3DS,2013,0.01,2013.0,X360,0.01
391,Turbo: Super Stunt Squad,3DS,2013,0.01,2013.0,WiiU,0.02
397,Kiniro no Corda 3,3DS,2015,0.01,2015.0,3DS,0.01
390,Turbo: Super Stunt Squad,3DS,2013,0.01,2013.0,Wii,0.01
396,Kiniro no Corda 3,3DS,2015,0.01,2010.0,PS2,0.03


'above is the least popular games on 3DS'

In [39]:
# Take a look at the general distribution of games by genre.  

histogram(game2, 'genre', 'Number of Games Released within Each Genre')

<div style="border:solid green 4px; padding: 20px">Sorted barchart is always better than unsorted one.</div>

In [40]:
# What can we say about the most profitable genres?

trace  = single_groupby_bar(game, 'genre', 'total_sales')
layout = layout_plot('Total Sales by Genre', 'Genre', 'Total Sales')
fig    = go.Figure(data=trace, layout=layout)
py.iplot(fig)

Based on observations, we conclude that:
- genre type 'Action' and 'Sports' are the most profitable both in terms of number of games avaliable and sales.
- Following with the same logic, 'Strategy' and 'Puzzle' are the least profitable genres.

### Conclusion
- Our data shown that the gaming industry experienced a production peak from 2002 to 2011. Since then, all platforms begun shrinking to the point where now we only see a few major players in the market. They are the PS, Xbox and Wii Series.
- With that said, the current market leaders are XOne, PS4, and 3DS.
- Some historically popular platforms such as PS2, DS and PSP, have faded due to outdated technology. Other platforms like PC has been around for the longest period but its sales stay relatively low comparing to other platforms. There are also platforms like WiiU and PSV which are designed to be portable. But they never became mainstream.
- Taking PS2 as an example since it is the most seccessful platforms throughout the history, the entire lifecyle of a platform, from begin to mature then to outdated, is 10 years. It takes 2 years for a platform to become mature, then take another 3 years for such platform to remain in popular. Lastly, it take 5 years for a platform to slowly fade out.
- It's observed that the higher the critic scores, the more variety of names that are available in the market and thus the higher sales one could potentially achieve. This is not to say that critic score is a direct causation to higher or lowered sales.

# Step 4. Create a user profile for each region
For each region (NA, EU, JP), determine:
- The top five platforms. Describe variations in their market shares from region to region.
- The top five genres. Explain the difference.
- Do ESRB ratings affect sales in individual regions?

In [41]:
# For each region (NA, EU, JP), find the top five platforms

fig = make_subplots(1, 3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=['North America', 'Europe', 'Japan'])

fig.add_trace(go.Pie(labels=game2.groupby('platform')['na_sales'].sum().sort_values(ascending=False).keys().tolist()[:5], 
                     values=game2.groupby('platform')['na_sales'].sum().sort_values(ascending=False).values.tolist()[:5], 
                     scalegroup='one', 
                     name="na_sales"
                    ), 1, 1)

fig.add_trace(go.Pie(labels=game2.groupby('platform')['eu_sales'].sum().sort_values(ascending=False).keys().tolist()[:5], 
                     values=game2.groupby('platform')['eu_sales'].sum().sort_values(ascending=False).values.tolist()[:5], 
                     scalegroup='one',
                     name="eu_sales"
                    ), 1, 2)

fig.add_trace(go.Pie(labels=game2.groupby('platform')['jp_sales'].sum().sort_values(ascending=False).keys().tolist()[:5], 
                     values=game2.groupby('platform')['jp_sales'].sum().sort_values(ascending=False).values.tolist()[:5],   
                     scalegroup='one',
                     name="jp_sales"
                    ), 1, 3)

fig.update_layout(title_text='Top 5 Platforms for Major Regions')
fig.show()

Describe variations in their market shares from region to region. Please note that the following description use data from all years after 2000. So the pie chart is a reflection of entire history not current market share.
- From the pie chart above, it is observed that North America is the largest market, followed by Europe. 
- Historically, Wii, PS2, and X360 are the three most popular platform. Interestingly, the XBox series did not make it to the top 5 platforms in Japan. And in Europe's market, PS series occupy almost half of the market. 

<div style="border:solid green 4px; padding: 20px">Very good. We can see how unique Japanese user's tastes are.</div>

In [42]:
# For each region (NA, EU, JP), fine the top five genres.

fig = make_subplots(1, 3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=['North America', 'Europe', 'Japan'])

fig.add_trace(go.Pie(labels=game2.groupby('genre')['na_sales'].sum().sort_values(ascending=False).keys().tolist()[:5], 
                     values=game2.groupby('genre')['na_sales'].sum().sort_values(ascending=False).values.tolist()[:5], 
                     scalegroup='one', 
                     name="na_sales"
                    ), 1, 1)

fig.add_trace(go.Pie(labels=game2.groupby('genre')['eu_sales'].sum().sort_values(ascending=False).keys().tolist()[:5], 
                     values=game2.groupby('genre')['eu_sales'].sum().sort_values(ascending=False).values.tolist()[:5], 
                     scalegroup='one',
                     name="eu_sales"
                    ), 1, 2)

fig.add_trace(go.Pie(labels=game2.groupby('genre')['jp_sales'].sum().sort_values(ascending=False).keys().tolist()[:5], 
                     values=game2.groupby('genre')['jp_sales'].sum().sort_values(ascending=False).values.tolist()[:5],   
                     scalegroup='one',
                     name="jp_sales"
                    ), 1, 3)

fig.update_layout(title_text='Top 5 Genres for Major Regions')
fig.show()

- Sports and actions are very popular in all three major markets.
- In the Japanese market, role-playing type of game is particularly popular, occupys 40% of market share.
- Shooter genre is popular in the US and European market, both shown a 20% market share.

In [43]:
# Do ESRB ratings affect sales in individual regions?

trace1  = single_groupby_bar(game2, 'rating', 'na_sales')
trace2  = single_groupby_bar(game2, 'rating', 'eu_sales')
trace3  = single_groupby_bar(game2, 'rating', 'jp_sales')
data   = [trace1,trace2,trace3]
layout = layout_plot("Regional Sales by Rating", "Rating", "Sales")
fig    = go.Figure(data=data, layout=layout)
py.iplot(fig)

Clearly, games rating a E has the most potential to reach large sales. Please note, this does not imply that this rating has a direct impact on sales. One potential explaination could be that there are more games in that rating criterion and thus the cumulated sales is larger than other rating.

# Step 5. Test the following hypotheses:
- Average user ratings of the Xbox One and PC platforms are the same.
- Average user ratings for the Action and Sports genres are different.

Testing for user ratings of Xbox One and PC platforms
- Null: the average user ratings of Xbox One and PC are the same
- Alternative: the average user rating of Xbox One and PC are not the same

In [44]:
# test for equal variance

temp1 = game2.query('platform == "XOne" and user_score_new != -1')['user_score_new']
temp2 = game2.query('platform == "PC" and user_score_new != -1')['user_score_new']

F = np.var(temp1) / np.var(temp2)
df1,df2 = len(temp1) - 1, len(temp2) - 1
p_value = st.f.cdf(F, df1, df2)
alpha = 0.05

if p_value > alpha:
    print('Failed to reject the null hypothesis, hence Var(ultimate) == Var(surf)', '\n')
else:
    print('Have sufficient significance to reject the null hypothesis, hence Var(ultimate) != Var(surf)', '\n')

# hpothesis testing - average profit (monthly) of all users from each plan

results = st.ttest_ind(temp1, temp2, equal_var=True)
print('p-value:', results.pvalue)
if results.pvalue < alpha:
      print('Have sufficient significance to reject the null hypothesis, hence the avg user score of the two population are different.')
else:
      print('Failed to reject the null hypothesis, hence the avg user score of the two population is the same.')
del temp1, temp2

Failed to reject the null hypothesis, hence Var(ultimate) == Var(surf) 

p-value: 5.195035813817854e-05
Have sufficient significance to reject the null hypothesis, hence the avg user score of the two population are different.


Testing average user ratings for the Action and Sports genres 
- Null: the average user ratings for Action genre are the same as the average ratings for Sports genre.
- Alternative: the average user ratings for Action genre are different than the average ratings for Sports genre.

In [45]:
# test for equal variance

temp1 = game2.query('genre == "Action" and user_score_new != -1')['user_score_new']
temp2 = game2.query('genre == "Sports" and user_score_new != -1')['user_score_new']

F = np.var(temp1) / np.var(temp2)
df1,df2 = len(temp1) - 1, len(temp2) - 1
p_value = st.f.cdf(F, df1, df2)
alpha = 0.05

if p_value > alpha:
    print('Failed to reject the null hypothesis, hence Var(ultimate) == Var(surf)', '\n')
else:
    print('Have sufficient significance to reject the null hypothesis, hence Var(ultimate) != Var(surf)', '\n')

# hpothesis testing - average profit (monthly) of all users from each plan

results = st.ttest_ind(temp1, temp2, equal_var=False)
print('p-value:', results.pvalue)
if results.pvalue < alpha:
      print('Have sufficient significance to reject the null hypothesis, hence the avg user score of the two population are different.')
else:
      print('Failed to reject the null hypothesis, hence the avg user score of the two population is the same.')
del temp1, temp2

Have sufficient significance to reject the null hypothesis, hence Var(ultimate) != Var(surf) 

p-value: 0.10230892151264945
Failed to reject the null hypothesis, hence the avg user score of the two population is the same.


<div style="border:solid green 4px; padding: 20px">Both hypotheses pairs are formulated correctly. Variances equalities are also checked well.</div>

# Step 6. Write a general conclusion
As a conclusion, the game console industry has begun to shrink since 2009 in terms of both sales volume and game production. However, the PS, Wii and XBox series have proven to be very popular throughout the history to the point that they still dominate the market even today. North America is the single largest market in terms of sales volumn, followed by Europe. The US and European market share a lot of similarity in terms of the popular game genres. 

It's also observed that while games that have lower sales are likely to happen if its critic score and user score are low. However, please note that critic score alone is not likely to have an direct impact on sales volumn for a game. In other words, association does not imply a casuation. The same logic also applys to rating.

Although there is no clear winner among the top three platforms (Wii, PS, XBox). One may make a reasonable prediction that these three platforms are likely to coexist in the foreseeable future. It is recommended to pay close attention to the newest model avalible from these three platforms, while analyze the current offer's lifecycle. 

<div style="border:solid green 4px; padding: 20px">Outstanding work in terms of visualization techiques and maticulous mathematical inference. Not much to critisize, to be honest. Slight reminder, also practice using plotting libraries other than plotly. The reason is - when you are facing some real big datasets, graphs with non-aggregated objects (like scatter plots) seem to require an unaffordable amount of RAM, if processed in that interactive manner plotly uses.<hr>Keep up the good work, take care, see you!</div>

# Summary
Here’s what project reviewers will be looking at at when evaluating your project:
- [x] How do you describe the problems you identify in the data?
- [x] How do you prepare a dataset for analysis?
- [x] How do you build distribution graphs and how do you explain them?
- [x] How do you calculate standard deviation and variance?
- [x] Do you formulate alternative and null hypotheses?
- [x] What methods do you apply when testing them?
- [x] Do you explain the results of your hypothesis tests?
- [x] Do you follow the project structure and keep your code neat and comprehensible?
- [x] Which conclusions do you reach?
- [x] Did you leave clear, relevant comments at each step?

Thank you for reviewing my project. 