# 2017 Game Sales Outlook <a id='back'></a> 

* [Introduction](#intro)
* [1 Data Overview and Preprocessing](#data_over)
    * [1.1 Imports and Load Data](#load)
    * [1.2 Data Preprocessing](#data_pre)
    * [1.8 Aggregate Data](#aggregation)
    * [1.9 Preliminary Conclusion](#conc_prelim)
* [2 Data Analysis](#analysis)
    * [1 Select Forecast Data](#forecast_data)
    * [2 Most Popular Platform](#platform_analysis)
    * [2.1 Critic Score vs User Score](#score_analysis)
    * [2 Genres](#genre_analysis)
    * [3 Regional User](#region_analysis)
* [4 Test Statistical Hypotheses](#test_hyp)
    * [4.1 First Hypothesis](#hyp_1)
    * [4.2 Second Hypothesis](#hyp_2)
* [5 Final Conclusion](#fin_conc)

# Introduction <a id='intro'></a>
    
lorem ipsum croque monsieur asta la vista baby

test the following hypotheses:
* Average user ratings of the Xbox One and PC platforms are the same.
* Average user ratings for the Action and Sports genres are different.

* [back](#back)
<a id='forecast_data'></a>

## Data Overview and Preprocessing <a id='data_over'></a>

In [285]:
# import pandas, a general data-management library
import pandas as pd

# import pyplot, a graph plotting library
from matplotlib import pyplot as plt 

# import scipy, a statistical analysis library
from scipy import stats as st

# import plotly.express, a high level plotting library
import plotly.express as px

In [2]:
# load the data into a pandas dataframe
df = pd.read_csv('games.csv')

In [3]:
#print general information about the dataset
df.info()
display(df.describe())
display(df.sample(n=10, random_state=0))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16713 non-null  object 
 1   Platform         16715 non-null  object 
 2   Year_of_Release  16446 non-null  float64
 3   Genre            16713 non-null  object 
 4   NA_sales         16715 non-null  float64
 5   EU_sales         16715 non-null  float64
 6   JP_sales         16715 non-null  float64
 7   Other_sales      16715 non-null  float64
 8   Critic_Score     8137 non-null   float64
 9   User_Score       10014 non-null  object 
 10  Rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB


Unnamed: 0,Year_of_Release,NA_sales,EU_sales,JP_sales,Other_sales,Critic_Score
count,16446.0,16715.0,16715.0,16715.0,16715.0,8137.0
mean,2006.484616,0.263377,0.14506,0.077617,0.047342,68.967679
std,5.87705,0.813604,0.503339,0.308853,0.186731,13.938165
min,1980.0,0.0,0.0,0.0,0.0,13.0
25%,2003.0,0.0,0.0,0.0,0.0,60.0
50%,2007.0,0.08,0.02,0.0,0.01,71.0
75%,2010.0,0.24,0.11,0.04,0.03,79.0
max,2016.0,41.36,28.96,10.22,10.57,98.0


Unnamed: 0,Name,Platform,Year_of_Release,Genre,NA_sales,EU_sales,JP_sales,Other_sales,Critic_Score,User_Score,Rating
7634,Press Your Luck 2010 Edition,DS,2009.0,Misc,0.18,0.0,0.0,0.01,,tbd,E
13771,Aeon Flux,PS2,2005.0,Action,0.02,0.02,0.0,0.01,66.0,5.8,T
3051,Castlevania: Lords of Shadow,X360,2010.0,Action,0.42,0.17,0.01,0.05,83.0,7.8,M
15726,Prince of Stride,PSV,2015.0,Adventure,0.0,0.0,0.02,0.0,,,
578,Final Fantasy XIII-2,PS3,2011.0,Role-Playing,0.78,0.73,0.89,0.23,79.0,6.6,T
14668,World of Zoo,PC,2009.0,Simulation,0.0,0.02,0.0,0.01,,8.4,E
10421,Gravity Games Bike: Street Vert Dirt,PS2,2002.0,Sports,0.05,0.04,0.0,0.01,24.0,4.1,T
10231,Calling,Wii,2009.0,Adventure,0.06,0.04,0.0,0.01,49.0,6.7,T
12163,Titanic Mystery,DS,2010.0,Puzzle,0.05,0.01,0.0,0.01,,tbd,T
1090,PGR: Project Gotham Racing 2,XB,2003.0,Racing,0.97,0.59,0.04,0.07,,,


## Data Overview Summary <a id='data_over'></a>

We can see we have a dataset with 16715 rows and the following columns:
* Name
* Platform
* Year_of_Release
* Genre
* NA_sales (North American sales in USD million)
* EU_sales (sales in Europe in USD million)
* JP_sales (sales in Japan in USD million)
* Other_sales (sales in other countries in USD million)
* Critic_Score (maximum of 100)
* User_Score (maximum of 10)
* Rating (ESRB)



Further, there are a few issues with our data which will require preprocessing before we can perform our analysis.

These issues amount to:
* column names in improper case
* missing values
* values in a numeric-type data field which are of an indeterminate value. To be specific, these are the 'tbd' values in our Critic_Score and User_Score columns.

Further, we will create one new column using data from those existing to aid our analysis.

[Back to Contents](#back)

# Data Preprocessing <a id='data_pre'></a>

First, we will handle the simple task of reassigning our column names to proper snake_case.

In [4]:
# rename columns to proper snake_case
df.columns= df.columns.str.lower()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             16713 non-null  object 
 1   platform         16715 non-null  object 
 2   year_of_release  16446 non-null  float64
 3   genre            16713 non-null  object 
 4   na_sales         16715 non-null  float64
 5   eu_sales         16715 non-null  float64
 6   jp_sales         16715 non-null  float64
 7   other_sales      16715 non-null  float64
 8   critic_score     8137 non-null   float64
 9   user_score       10014 non-null  object 
 10  rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB


Next, we will process the missing values.

We will address this by-column as-follows:
* name  - drop these entries if it will not affect our analysis, otherwisefill with 'noname' as the name of the game is not pertinent to our analysis.
* year_of_release - drop all entries with null values as the game's release year is necessary for our forecast and filling these values could alter the results of the analysis.
* critic_score  - Leave these as NA since filling them will alter minor elements of our analysis
* user_score    - Leave these as NA since filling them will alter minor elements of our analysis
* rating        - Leave these as NA since filling them will alter minor elements of our analysis

In [5]:
# For curiosity's sake we will take a look at the two n/a names column values
display(df[(df['name'].isnull())])

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
659,,GEN,1993.0,,1.78,0.53,0.0,0.08,,,
14244,,GEN,1993.0,,0.0,0.0,0.03,0.0,,,


Since these entries are for games that are fifteen years old, were released for a console which is no longer supported(Sega Genesis), and are missing their genre data(one of our variables of interest), we can drop these entries without altering the results of our analysis or forecast.

In [6]:
# drop entries with null values in the name column
df.dropna(subset='name', inplace=True)

#drop entries with null values in the year_of_release column
df.dropna(subset='year_of_release', inplace=True)

#verify total entries to non-null entries in dropped columns
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16444 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             16444 non-null  object 
 1   platform         16444 non-null  object 
 2   year_of_release  16444 non-null  float64
 3   genre            16444 non-null  object 
 4   na_sales         16444 non-null  float64
 5   eu_sales         16444 non-null  float64
 6   jp_sales         16444 non-null  float64
 7   other_sales      16444 non-null  float64
 8   critic_score     7983 non-null   float64
 9   user_score       9839 non-null   object 
 10  rating           9768 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.5+ MB


We can see our total number of entries, 16444, matches our number of entries for our primary columns-of-interest.

Now we will proceed with setting our columns to the proper datatype.

In [7]:
# convert data to proper filetypes
# using this method of conversion will throw an error if there are decimal values
df['year_of_release'] = df['year_of_release'].astype(int)

# convert the critic_score column to float since pandas .corr method will not work with the integer datatype
df['critic_score'] = df['critic_score'].astype(float)

# convert the user score column to float since it has decimal values
df['user_score'] = pd.to_numeric(df['user_score'], errors='coerce')

df.info()
display(df.sample(n=10, random_state=0))

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16444 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             16444 non-null  object 
 1   platform         16444 non-null  object 
 2   year_of_release  16444 non-null  Int64  
 3   genre            16444 non-null  object 
 4   na_sales         16444 non-null  float64
 5   eu_sales         16444 non-null  float64
 6   jp_sales         16444 non-null  float64
 7   other_sales      16444 non-null  float64
 8   critic_score     7983 non-null   Int64  
 9   user_score       7463 non-null   float64
 10  rating           9768 non-null   object 
dtypes: Int64(2), float64(5), object(4)
memory usage: 1.5+ MB


Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
12857,MotoGP 4 - Official Game of MotoGP,PS2,2005,Racing,0.03,0.02,0.0,0.01,,,
14457,Stacked with Daniel Negreanu,XB,2006,Misc,0.02,0.01,0.0,0.0,61.0,,T
6491,Riding Spirits,PS2,2002,Racing,0.13,0.1,0.0,0.03,59.0,9.0,E
6134,Kung Fu Chaos,XB,2003,Fighting,0.21,0.06,0.0,0.01,68.0,8.5,T
1185,Mario Party 7,GC,2005,Misc,0.95,0.11,0.46,0.04,64.0,7.9,E
11096,Pachitte Chonmage Tatsujin 10: Pachinko Fuyu n...,PS2,2007,Misc,0.0,0.0,0.09,0.0,,,
12447,2 Games in 1: Sonic Pinball Party & Columns Crown,GBA,2005,Misc,0.04,0.02,0.0,0.0,,,
9976,Sengoku Cyber: Fujimaru Jigokuhen,PS,1995,Strategy,0.0,0.0,0.11,0.01,,,
5463,NASCAR Thunder 2002,XB,2001,Racing,0.25,0.07,0.0,0.01,82.0,,E
13905,Dora's Big Birthday Adventure,PS2,2010,Misc,0.02,0.01,0.0,0.0,,,E


Finally, we will create a new column 'global_sales' to aid our analysis later.

In [8]:
# create 'global_sales' column as a sum of the other *_sales columns
df['global_sales'] = df['na_sales'] + df['eu_sales'] + df['jp_sales'] + df['other_sales']

df.info()
display(df.sample(n=10, random_state=0))

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16444 entries, 0 to 16714
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             16444 non-null  object 
 1   platform         16444 non-null  object 
 2   year_of_release  16444 non-null  Int64  
 3   genre            16444 non-null  object 
 4   na_sales         16444 non-null  float64
 5   eu_sales         16444 non-null  float64
 6   jp_sales         16444 non-null  float64
 7   other_sales      16444 non-null  float64
 8   critic_score     7983 non-null   Int64  
 9   user_score       7463 non-null   float64
 10  rating           9768 non-null   object 
 11  global_sales     16444 non-null  float64
dtypes: Int64(2), float64(6), object(4)
memory usage: 1.7+ MB


Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating,global_sales
12857,MotoGP 4 - Official Game of MotoGP,PS2,2005,Racing,0.03,0.02,0.0,0.01,,,,0.06
14457,Stacked with Daniel Negreanu,XB,2006,Misc,0.02,0.01,0.0,0.0,61.0,,T,0.03
6491,Riding Spirits,PS2,2002,Racing,0.13,0.1,0.0,0.03,59.0,9.0,E,0.26
6134,Kung Fu Chaos,XB,2003,Fighting,0.21,0.06,0.0,0.01,68.0,8.5,T,0.28
1185,Mario Party 7,GC,2005,Misc,0.95,0.11,0.46,0.04,64.0,7.9,E,1.56
11096,Pachitte Chonmage Tatsujin 10: Pachinko Fuyu n...,PS2,2007,Misc,0.0,0.0,0.09,0.0,,,,0.09
12447,2 Games in 1: Sonic Pinball Party & Columns Crown,GBA,2005,Misc,0.04,0.02,0.0,0.0,,,,0.06
9976,Sengoku Cyber: Fujimaru Jigokuhen,PS,1995,Strategy,0.0,0.0,0.11,0.01,,,,0.12
5463,NASCAR Thunder 2002,XB,2001,Racing,0.25,0.07,0.0,0.01,82.0,,E,0.33
13905,Dora's Big Birthday Adventure,PS2,2010,Misc,0.02,0.01,0.0,0.0,,,E,0.03


## Preprocessing Conclusion <a id='pre_conc'></a>

Now our preprocessing is complete.

We have prepared the dataset for analysis and accounted for the missing data.

[Back to Contents](#back)

# Data Analysis  <a id='analysis'></a>


* compare total sales by-console for the entire forecast year-range
* compare the total sales by-console split further by region
* Compare the relationship between sales, User reviews, and Professional reviews for the most popular platform
* Compare the relationship between sales, User reviews, and Professional reviews for other platforms
* Analyze the distribution of games by genre


* reduce our dataset to those years
* build user profiles for each region(NA, EU, JP)
* userprofile platform releases, revenue, esrb

## Select Forecast Data
<a id='forecast_data'></a>
Next, we will investigate which year-range is most relevant to our prognosis for 2017.

To do this, we will look at the distribution of total global game sales by-year and game release-years per-console.

In [9]:
# plot a histogram of the total global sales over the by-year per-platform for the entire dataset
plt = px.histogram(df, 'year_of_release', 'global_sales', color='platform')
plt.show()

plt = px.box(df, 'platform', 'year_of_release')
plt.show()

Let's investigate our game release years a little further.

In [10]:
# create a temporary dataframe with the first and last years of a game release, and difference between those
df_tmp = df.groupby(['platform'])['year_of_release'].min().reset_index()
df_tmp.rename({'year_of_release':'first_year'}, axis=1, inplace=True)
df_tmp = df_tmp.merge(df.groupby(['platform'])['year_of_release'].max().reset_index(), how='outer', on='platform')
df_tmp.rename({'year_of_release':'last_year'}, axis=1, inplace=True)
df_tmp['years_on_market'] = df_tmp['last_year'] - df_tmp['first_year']

# then make a simple plot of our aggregated total years of games released to the market
plt= px.bar(df_tmp, 'platform', 'years_on_market')
plt.show()

We see most consoles have a new-games-life of 5 to 10 years with two major outliers, PC which is not a holistic platform and Gameboy DS.

Further, from our boxplot we can see new consoles are released roughly within the same range.

In [11]:
# Create a plot similar to that above, reduced to years from 2005 to 2016
plt = px.histogram(df[(df['year_of_release'] >= 2005)], 'year_of_release', 'global_sales', color='platform')
plt.show()

We can see the latest major change in game sales by-year per-platform occurs in 2013 when two next generation consoles were released, the PS4 and Xbox One. 

In [12]:
# create a new dataset from our original dataset filtered for years from 2013 to 2016(current)
df_from_2013 = df[(df['year_of_release'] >= 2013)]

# Create a plot similar to that above, reduced to years from 2013 to 2016
plt = px.histogram(df_from_2013, 'year_of_release', 'global_sales', color='platform')
plt.show()

We can now clearly see the pattern of diminishing sales for the two legacy platforms which previously held the greatest market share, the PS3 and Xbox 360.

Further, we see the market share for the PS4 and Xbox One are growing.

Since:
* 2013 is the beginning of this trend
* console game-release year-ranges for consoles are 5 to 10 years

We will use the years 2013 to 2016 for our forecast.

Now we will perform the following analytic steps on our forecast dataset:
* compare total sales by-console for the entire forecast year-range
* compare the total sales by-console split further by region
* Compare the relationship between sales, User reviews, and Professional reviews for the most popular platform
* Compare the relationship between sales, User reviews, and Professional reviews for other platforms
* Analyze the distribution of games by genre

## Platform Popularity <a id='platform_analysis'></a> 

In [13]:
# bar chart per-platform comparing total sales
plt = px.bar(df_from_2013, x='platform',
             y='global_sales',
             title='Total Sales per Platform from 2013 to 2016',
             labels={'platform':'Platform','global_sales':'Global Sales'})
plt.show()

Playstation 4 has the most sales in our forecast range by a wide margin.

In [14]:
# bar chart per-platform comparing sales for each region
plt = px.bar(df_from_2013, x='platform',
             y=['na_sales', 'eu_sales', 'jp_sales', 'other_sales'],
             barmode='group',
             title='Total Sales per Platform from 2013 to 2016',
             labels={'platform':'Platform','value':'Regional Sales', 'variable':'Sales Region', 'na_sales':'North America Sales', 'eu_sales':'Europe Sales','jp_sales':'Japan Sales','other_sales':'Other Sales'})
plt.show()

We now see the bulk of Playstation 4 sales during the forecast period were in the US and EU markets.

[Back to Contents](#back)

## Critic Scores vs User Scores <a id='score_analysis'></a>

In [15]:
# create a scatterplot of PS4 game sales per user and critic reviews
plt = px.scatter(df_from_2013[(df_from_2013['platform']=='PS4')], 'critic_score', 'global_sales', opacity=0.5)
plt.show()
plt = px.scatter(df_from_2013[(df_from_2013['platform']=='PS4')], 'user_score', 'global_sales', opacity=0.5)
plt.show()

In [53]:
# print number of non-null values in the critic_score column(sample size), filtering for the ps4 platform
print('For', df_from_2013[df_from_2013['critic_score'].notna() & (df_from_2013['platform']=='PS4')]['critic_score'].count(), 'non-null Critic Score samples:')
#print the pearson coefficient between the critic_score and global_sales columns, filtering for the ps4 platform
print('The Pearson correlation coefficient between Critic Score and global sales is', df_from_2013[(df_from_2013['platform']=='PS4')]['critic_score'].corr(df_from_2013['global_sales']))

# print number of non-null values in the user_score column(sample size), filtering for the ps4 platform
print('For', df_from_2013[(df_from_2013['user_score'].notna() & (df_from_2013['platform']=='PS4'))]['user_score'].count(), 'non-null User Score samples:')
#print the pearson coefficient between the user_score and global_sales columns, filtering for the ps4 platform
print('The Pearson correlation coefficient between User Score and global sales is', df_from_2013[(df_from_2013['platform']=='PS4')]['user_score'].corr(df_from_2013['global_sales']))

For 252 non-null Critic Score samples:
The Pearson correlation coefficient between Critic Score and global sales is 0.40656790206178095
For 257 non-null User Score samples:
The Pearson correlation coefficient between User Score and global sales is -0.031957110204556424


Based on a glance at our graphs and these results, we can make the following observations:
* There is a medium correlation between Critic Scores and Global sales for PS4 games sold between 2013 and 2016.
* There is a very small negative correlation between User Scores and Global sales for PS4 games sold between 2013 and 2016.

[Back to Contents](#back)

In [89]:
# Create a scatterplot of all platform game sales per user and critic reviews
plt = px.scatter(df_from_2013, 'critic_score', 'global_sales', color='platform')
plt.show()
plt = px.scatter(df_from_2013, 'user_score', 'global_sales',color='platform')
plt.show()

In [49]:
# print number of non-null values in the critic_score column(sample size)
print('For', df_from_2013[df_from_2013['critic_score'].notna()]['critic_score'].count(), 'non-null Critic Score samples:')
#print the pearson coefficient between the critic_score and global_sales columns
print('The Pearson correlation coefficient between Critic Score and global sales is', df_from_2013['critic_score'].corr(df_from_2013['global_sales']))

# print number of non-null values in the user_score column(sample size)
print('For', df_from_2013[df_from_2013['user_score'].notna()]['user_score'].count(), 'non-null User Score samples:')
#print the pearson coefficient between the user_score and global_sales columns
print('The Pearson correlation coefficient between User Score and global sales is', df_from_2013['user_score'].corr(df_from_2013['global_sales']))

For 991 non-null Critic Score samples:
The Pearson correlation coefficient between Critic Score and global sales is 0.3136995151027368
For 1192 non-null User Score samples:
The Pearson correlation coefficient between User Score and global sales is -0.0026078133545982705


Based on a glance at our graphs and these results, we can make the following observations:
* There is a medium-small correlation between Critic Scores and Global sales for all console games sold between 2013 and 2016.
* There is a very small negative correlation between User Scores and Global sales for all console games sold between 2013 and 2016.

[Back to Contents](#back)

## Distribution of Genres in Games <a id='genre_analysis'></a>

In [182]:
# Create a new dataframe with a summation of the global sales and count of games per genre
# this will aid creating our clustered bar graph
df_from_2013_genre_counts = df_from_2013.groupby('genre')['global_sales'].sum().reset_index()
df_from_2013_genre_counts = df_from_2013_genre_counts.merge(df_from_2013.groupby('genre')['name'].count().reset_index())
df_from_2013_genre_counts.rename({'name':'count'}, axis=1, inplace=True)
df_from_2013_genre_counts['revenue_per_game'] = df_from_2013_genre_counts['global_sales'] / df_from_2013_genre_counts['count']
df_from_2013_genre_counts.sort_values('count')

Unnamed: 0,genre,global_sales,count,revenue_per_game
5,Puzzle,3.17,17,0.186471
11,Strategy,10.08,56,0.18
9,Simulation,21.76,62,0.350968
4,Platform,42.63,74,0.576081
2,Fighting,35.31,80,0.441375
6,Racing,39.89,85,0.469294
3,Misc,62.82,155,0.40529
8,Shooter,232.98,187,1.245882
10,Sports,150.65,214,0.703972
1,Adventure,23.64,245,0.09649


In [203]:
# Create a grouped bar plot counting the number of games by genre and the global sales of each genre
plt = px.bar(df_from_2013_genre_counts,
             x='genre',
             y=['count', 'global_sales'],
             barmode='group',
             )
plt.update_layout(xaxis={'categoryorder': 'max descending'})
plt.show()

Based on this visual data we can make the following assumptions:
* Shooter games outperform all other genres, generating more than $1 million in gross revenue per game
* Action games are the most-numerous game released to the market between 2013 and 2016
* Action games as a whole also generate more gross revenue than Shooter games
* Adventure, Strategy, and Puzzle games generate the least amount of revenue per-game, despite the number of Adventure games brought to market

[Back to Contents](#back)

## Regional User <a id='region_analysis'></a>

Now we will analyze our data by regional sales.

In [239]:
print('North America game sales basic stats:')
print(df_from_2013[(df_from_2013['na_sales'] != 0)]['na_sales'].describe(),'\n')

North America game sales basic stats:
count    1309.000000
mean        0.334385
std         0.682376
min         0.010000
25%         0.040000
50%         0.110000
75%         0.340000
max         9.660000
Name: na_sales, dtype: float64 



In [232]:
print('Europe game sales basic stats:')
print(df_from_2013[(df_from_2013['na_sales'] != 0)]['eu_sales'].describe(),'\n')

Europe game sales basic stats:
count    1309.000000
mean        0.289259
std         0.644632
min         0.000000
25%         0.020000
50%         0.100000
75%         0.280000
max         9.090000
Name: eu_sales, dtype: float64 



In [231]:
print('Japan game sales basic stats:')
print(df_from_2013[(df_from_2013['na_sales'] != 0)]['jp_sales'].describe(),'\n')

Japan game sales basic stats:
count    1309.000000
mean        0.062009
std         0.259217
min         0.000000
25%         0.000000
50%         0.000000
75%         0.040000
max         4.350000
Name: jp_sales, dtype: float64 



In [298]:
plt = px.histogram(df_from_2013, x='platform', y=['na_sales', 'eu_sales', 'jp_sales'], barmode='group')
plt.update_layout(xaxis={'categoryorder': 'total descending'})
plt.show()

In [237]:
plt = px.histogram(df_from_2013, x='genre', y=['na_sales', 'eu_sales', 'jp_sales'], barmode='group')
plt.update_layout(xaxis={'categoryorder': 'total descending'})
plt.show()

In [243]:
plt = px.histogram(df_from_2013, x='rating',  y=['na_sales', 'eu_sales', 'jp_sales'], barmode='group',category_orders={'rating':['E','E10+', 'T', 'M']})
plt.show()

## Games by-region analysis conclusion

### General Analysis Conclusion

[Back to Contents](#back)

# Test Hypotheses <a id='test_hyp'></a>

## First Hypothesis
<a id='hyp_1'></a>

Next we will test our First Hypothesis:
* Average user ratings of the Xbox One and PC platforms are the same.

To test this, we will use the following null hypothesis:
<br>Average user ratings of the Xbox One and PC platforms <u>**do not**</u> differ.

And we will adopt the following alternative Hypothesis:
<br>Average user ratings of the Xbox One and PC platforms differ.

Where a failure to reject our null hypothesis affirms our First Hypothesis.

We choose an alpha value of 0.05 or a 95% significance level which is the lowest level that is commonly accepted.

In [306]:
# Test the hypotheses
# test at a 95% confidence level
alpha = 0.01
conf_percent = int((1-alpha) * 100)
# perform a t-test on all non-null values for the game user scores of the respective platforms to be tested
results = st.ttest_ind(df_from_2013[((df_from_2013['platform']=='XOne') & (df_from_2013['user_score'].notnull()))]['user_score'],
                        df_from_2013[((df_from_2013['platform']=='PC') & (df_from_2013['user_score'].notnull()))]['user_score'],
                        equal_var=False)

#print n for our samples tested
print('For:')
print(df_from_2013[((df_from_2013['platform']=='XOne') & (df_from_2013['user_score'].notnull()))]['user_score'].count(),'Xbox One Game user score samples and')
print(df_from_2013[((df_from_2013['platform']=='PC') & (df_from_2013['user_score'].notnull()))]['user_score'].count(),'PC Game user score samples')

#print our resulting p-value
print('p-value: ', results.pvalue)

#print our results with respect to the null hypothesis and chosen alpha/significance-level
if results.pvalue < alpha:
    print('We reject the null at with a ',conf_percent,'% significance level.', sep='')
else:
    print('We can\'t reject the null hypothesis at a ',conf_percent,'% sifnificance level.', sep='')

For:
182 Xbox One Game user score samples and
155 PC Game user score samples
p-value:  0.14759594013430463
We can't reject the null hypothesis at a 99% sifnificance level.


Since:
* our null hypothesis is a statistically-sound paraphrase of our First Hypothesis
* our sample size is large(n > 50)
* our chosen statistical significance level is 99%
* Given these conditions, we could not reject the null hypothesis

We can confidently assert:<br>Average user ratings of the Xbox One and PC platforms are the same.

## Second Hypothesis
<a id='hyp_2'></a>

Now we will test our second hypothesis: 
* Average user ratings for the Action and Sports genres differ.



To test this, we will use the following null hypothesis:
<br>The average user ratings for the Action and Sports genres  <u>**do not**</u> differ.

And the following alternative hypothesis:
<br>The The average user ratings for the Action and Sports genres differ.

Where rejecting the null hypothesis and adopting the alternative hypothesis affirms our Second Hypothesis.

Based on our previous analyses, we will choose an alpha value of 0.01 or, in other words, a significance level of 99%.

In [305]:
# Test the hypotheses
# test at a 99% confidence level
alpha = 0.01
conf_percent = int((1-alpha) * 100)
# perform a t-test on all non-null values for the game user scores of the respective genres to be tested
results = st.ttest_ind(df_from_2013[((df_from_2013['genre']=='Action') & (df_from_2013['user_score'].notnull()))]['user_score'],
                        df_from_2013[((df_from_2013['genre']=='Sports')& (df_from_2013['user_score'].notnull()))]['user_score'],
                        equal_var=False)

#print n for our samples tested
print('For:')
print(df_from_2013[(df_from_2013['genre']=='Action')]['user_score'].notnull().count(),'Action Game user score samples and')
print(df_from_2013[(df_from_2013['genre']=='Sports')]['user_score'].notnull().count(),'Sports Game user score samples')

#print our resulting p-value
print('p-value: ', results.pvalue)

#print our results with respect to the null hypothesis and chosen alpha/significance-level
if results.pvalue < alpha:
    print('We reject the null hypothesis at a ',conf_percent,'% significance level.', sep='')
else:
    print('We can\'t reject the null hypothesis at a ',conf_percent,'% sifnificance level.', sep='')

For:
766 Action Game user score samples and
214 Sports Game user score samples
p-value:  1.4460039700704315e-20
We reject the null hypothesis at a 99% significance level.



Since:
* our alternative hypothesis is a statistically-sound paraphrase of our First Hypothesis
* our sample size is large(n > 50)
* our chosen statistical significance level is 99%
* We rejected the null hypothesis and adopt the alternative hypothesis

We can confidently assert:<br>Average user ratings for the Action and Sports genres differ.

# Final Conclusion <a id='fin_conc'></a>



[Back to Contents](#back)