# Video Game Sales Data Analysis 
---
This Jupyter notebook will explore a dataset of international video game sales in order to identify patterns that determine whether a game succeeds or not.


#### Project Sections 
1. Initial Set Up and Data Preparation 
2. General Data Analysis 
3. Analysis By Sales Region 
4. Hypothesis Testing 

### Initial Set Up & Data Preparation 
---

In [1]:
# Import required libraries 
from scipy import stats as st
import numpy as np
import pandas as pd
import plotly.express as px

In [2]:
try:
    # Attempt to read the data
    df = pd.read_csv('games.csv')
    # If successful, print Confirmation
    print("The data has been read in as df.")
except Exception as e:
    # If an error occurs, print an error message
    print("Error reading data:", e)
    print("To get all project files please visit https://github.com/le-crupi64/Video-Game-Sales-Analysis")

The data has been read in as df.


#### Data Cleaning 
Part 1: Finding Issues 

In [3]:
# Sample Data 
df.sample(3)

Unnamed: 0,Name,Platform,Year_of_Release,Genre,NA_sales,EU_sales,JP_sales,Other_sales,Critic_Score,User_Score,Rating
15487,Darksiders: Warmastered Edition,PS4,2016.0,Action,0.02,0.0,0.0,0.0,79.0,8.5,M
5088,Reel Fishing III,PS2,2003.0,Sports,0.18,0.14,0.0,0.05,,7.8,E
8834,The Oregon Trail,Wii,2011.0,Simulation,0.14,0.0,0.0,0.01,,,


In [4]:
# Check data types and Missing values 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16713 non-null  object 
 1   Platform         16715 non-null  object 
 2   Year_of_Release  16446 non-null  float64
 3   Genre            16713 non-null  object 
 4   NA_sales         16715 non-null  float64
 5   EU_sales         16715 non-null  float64
 6   JP_sales         16715 non-null  float64
 7   Other_sales      16715 non-null  float64
 8   Critic_Score     8137 non-null   float64
 9   User_Score       10014 non-null  object 
 10  Rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB


In [5]:
# Check for duplicate rows 
df.duplicated().sum()

0

In [6]:
# Check for implicit duplicates
df['Platform'].unique()

array(['Wii', 'NES', 'GB', 'DS', 'X360', 'PS3', 'PS2', 'SNES', 'GBA',
       'PS4', '3DS', 'N64', 'PS', 'XB', 'PC', '2600', 'PSP', 'XOne',
       'WiiU', 'GC', 'GEN', 'DC', 'PSV', 'SAT', 'SCD', 'WS', 'NG', 'TG16',
       '3DO', 'GG', 'PCFX'], dtype=object)

In [7]:
df['Genre'].unique()

array(['Sports', 'Platform', 'Racing', 'Role-Playing', 'Puzzle', 'Misc',
       'Shooter', 'Simulation', 'Action', 'Fighting', 'Adventure',
       'Strategy', nan], dtype=object)

In [8]:
df['Rating'].unique()

array(['E', nan, 'M', 'T', 'E10+', 'K-A', 'AO', 'EC', 'RP'], dtype=object)

Part 2: Fixing Issues  
Upon initial exploration of the data, it has been determined that the following tasks must be carried out in order to prepare the data for analysis: 
- Column names should be made lowercase 
- Columns with object data types should have strings in lowercase
- user_score should be a float data type
- year_of_release should be an int or object
- Replace missing values: 
    - rating: Replace null with 'unrated'
    - critic_score and user_score: Replace null and 'tbd' with median of genre 
    - genre: replace null with 'misc'
    - name: drop null values (this is 0.01% of data)
    - year_of_release: drop rows with null values (this is 1.6% of data)
- Add a column with total sales 


In [9]:
# Make column names lowercase 
df = df.rename(
    columns={
        'Name':'name',
        'Platform':'platform',
        'Year_of_Release':'year_of_release',
        'Genre':'genre',
        'NA_sales':'na_sales',
        'EU_sales':'eu_sales',
        'JP_sales':'jp_sales',
        'Other_sales':'other_sales',
        'Critic_Score':'critic_score',
        'User_Score':'user_score',
        'Rating':'rating'
    }
)

# Make the contents of object datatype columns lowercase 
df['name'] = df['name'].str.lower()
df['platform'] = df['platform'].str.lower()
df['genre'] = df['genre'].str.lower()
df['rating'] = df['rating'].str.lower()

# Verify Changes
df.sample(1)

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
5236,south park rally,ps,1998.0,racing,0.2,0.13,0.0,0.02,,,


In [10]:
# Make user_score a float datatype and get rid of 'tbd'
df['user_score'] = pd.to_numeric(df['user_score'], errors='coerce')

# Make year_of_release a an int data type
df["year_of_release"] = df["year_of_release"].astype("Int64")

# Replace null values in ratings column with 'unrated'
df['rating'] = df['rating'].fillna('unrated')

# Replace null values in genre column with 'misc'
df['genre'] = df['genre'].fillna('misc')

# Replace null values in critic_score column with median for genre
grouped_avg = df.groupby('genre')['critic_score'].transform('median')
df['critic_score'] = df['critic_score'].fillna(grouped_avg)

# Replace null values in user_score column with median for genre group
grouped_avg = df.groupby('genre')['user_score'].transform('median')
df['user_score'] = df['user_score'].fillna(grouped_avg)

# Drop rows with null values in the name or year_of_release columns 
df= df.dropna(axis='rows')

In [11]:
# Add a column with total sales 
df['total_sales'] = df['eu_sales'] + df['jp_sales'] + df['na_sales'] + df['other_sales']

In [12]:
# Verify Changes 
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16444 entries, 0 to 16714
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             16444 non-null  object 
 1   platform         16444 non-null  object 
 2   year_of_release  16444 non-null  Int64  
 3   genre            16444 non-null  object 
 4   na_sales         16444 non-null  float64
 5   eu_sales         16444 non-null  float64
 6   jp_sales         16444 non-null  float64
 7   other_sales      16444 non-null  float64
 8   critic_score     16444 non-null  float64
 9   user_score       16444 non-null  float64
 10  rating           16444 non-null  object 
 11  total_sales      16444 non-null  float64
dtypes: Int64(1), float64(7), object(4)
memory usage: 1.6+ MB


In [13]:
df.sample(3)

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating,total_sales
9718,motogp 3 - official game of motogp,ps2,2003,racing,0.06,0.05,0.0,0.02,69.0,7.4,unrated,0.13
9594,dt racer,ps2,2005,racing,0.06,0.05,0.0,0.02,69.0,3.8,e,0.13
9237,battlefield: hardline,pc,2015,shooter,0.0,0.13,0.0,0.01,71.0,3.8,m,0.14


All previously observed issues have now been corrected. The data is now ready for analysis. 

### General Data Analysis
---
In this section, the entire dataset will be analyzed in order to answer some questions about global sales. The following questions will be answered: 
- Is the data for every period significant?
- What is the variance and standard deviation for total sales?
- How long does it generally take for new platforms to appear and old ones to fade?
- Which platforms are leading in sales? Which are growing and shrinking?
- Are the differences in global sales of games for each platform significant? What about average sales on various platforms? 
- Is there a correlation between user reviews and sales for the ps2, gb, and x360 platforms?
- How do the sales of Grand Theft Auto V on ps3 compare to sales of the same game on other platforms?
- Which genres are the most profitable? Which are the least?

#### Is the data for every period significant?
Volume of games released per year will determine if every period in the data set is significant. 

In [14]:
# Plot the total number of games released by year
fig1 = px.histogram(df,
                    x='year_of_release',
                    title='Figure 1: Number of Games Released by Year')
fig1.update_layout(xaxis_title='Release Year', yaxis_title='Number of Games Released')
fig1.show()


It appears that there was a very small number of games being published annually from the begining of the dataset in 1980 up until 1994, with each year showing under 100 games published. Begining in 1994, the number begins a strong and relatively consistant trend upwards from 121 games published in 1994 to a peak of 1,427 games publised in 2008. From there, the number of games published annually begins to trend downwards with the most significant drop being from 1,136 games publised in 2011 to 653 games published in 2012. Based on these findings, it has been determined that data period between 1980 and 1993 is not significant. 

In [15]:
# Filter out the insignificant period 
df_filtered = df.query("year_of_release > 1993")


Engine has switched to 'python' because numexpr does not support extension array dtypes. Please set your engine to python manually.



#### What is the variance and standard deviation for total sales?

In [16]:
# Isolate total sales for the significant period 
total_sales = df_filtered['total_sales']
std = np.std(total_sales)
vr = np.var(total_sales)
mean = np.mean(total_sales)

print('The standard deviation of total sales is ', std)
print('The variance of total sales is ', vr)
print('The mean of total sales is ', mean)

The standard deviation of total sales is  1.4704278829517192
The variance of total sales is  2.1621581589618746
The mean of total sales is  0.5121052304247776


These insights suggest that the total sales data exhibit a moderate level of variability around the mean.

#### How long does it generally take for new platforms to appear and old ones to fade?
To answer this question, sales variation from platform to platform will be analyzed to determine the platforms with the greatest total sales. Distributions will be built based on data from each year in order to analyze the rise and fall of these platforms. 

In [17]:
# Create a df with the sum of all sales grouped by year and platform 
yearly_platform_sales = df_filtered.groupby(['year_of_release','platform'])['total_sales'].sum().reset_index()

# Plot the total sales by year and platform 
fig2 = px.line(yearly_platform_sales, x="year_of_release", y="total_sales", color="platform", line_group="platform", hover_name="platform",
        line_shape="spline", render_mode="svg", title='Figure 2: Yearly Sales by Platform')
fig2.update_layout(xaxis_title='Year', yaxis_title='Total Sales (USD Million)')
fig2.show()

Moderately to highly sucessful platforms rise and fall with in the span of approximately ten years while less successful platforms may emerge and disappear in as little as one year. The only notable exception is PC as a platform, of which game sales span a period of over twenty-five years. This is likely due to the fact that many people have access to computers and there are reasons to own one beyond just gaming. 

#### Which platforms are leading in sales? Which are growing and shrinking?

In [18]:
# Create a df with the sum of all sales grouped by platform 
platform_sales = df_filtered.groupby('platform')['total_sales'].sum().reset_index()

# Plot the total sales by platform
fig3 = px.histogram(platform_sales,
                    x='platform',
                    y='total_sales',
                    title='Figure 3: Total Sales by Platform',
                    color='platform')
fig3.update_layout(xaxis_title='Platform', yaxis_title='Total Sales (USD Million)')
fig3.show()

In Figure 3, it can be observed that the platforms with the all time highest total game sales are ps2 ($1233.56M), x360 ($961.24M), ps3 ($931.34M), wii ($891.18M), and ds ($802.76M). In Figure 2, it can be observed that all platforms are trending downward with three exceptions:
- The gen platform sales has an upward trend, but no data for this platform exists in the dataset past 1994
- The 3do platform sales has an upward trend, but no data for this platform exists in the dataset past 1995
- The gba platform sales has an slight upward trend, but no data for this platform exists in the dataset past 2007

There are no platforms with an upward trend in the most recent two years of the data set. Considering the trend observed of platforms rising and falling in popularity within a span of ten years, it can be assumed that the most popular gaming platform(s) of 2015-2025 is not included in this dataset. 

#### Are the differences in global sales of games for each platform significant? What about average sales on various platforms? 

In [19]:
# Build a box plot of game sale variation by platform
fig4_1 = px.box(df_filtered, x="platform", y="total_sales", hover_name="name", title='Figure 4.1: Game Sale Variation By Platform with Outliers')
fig4_1.update_layout(xaxis_title='Game Platform', yaxis_title='Total Sales (USD Million)')
fig4_1.show()

In [20]:
# Build a box plot of game sale variation by platform without outliers
fig4_2 = px.box(df_filtered, x="platform", y="total_sales", hover_name="name", title='Figure 4.2: Game Sale Variation By Platform without Outliers')
fig4_2.update_layout(xaxis_title='Game Platform', yaxis_title='Total Sales (USD Million)', yaxis=dict(range=[0, 4]))
fig4_2.show()

The differences in global sales of games for each platform is somewhat significant. For the majority of platforms their median and first three quartiles are all below the half a million mark, with some more successful platforms having the third quartile mark around three-quarters of a million. The nes and gb platforms have the highest medians, with one million and 945k respectively. The gen platform has a third quartile mark at 1.73 million, but a median of only 150k. Some platforms have notable outliers, showing that even a platforms that typically has somewhat low game sales in general, like below 200k as a median like in the case of the wii platform, can still have a wildly successful game. The most notable of such outliers exist for the Wii platform with Wii Sports grossing $82.54 million. The platforms with the most variation are wii, gb, ds, x360, ps3, ps2, and snes in that order. Some of the record breaking game franchises include:
- For Wii: Wii Sports (multiple editions), Mario Cart, Wii Fit
- For gb: Pokemon(multiple editions), Super Mario Land
- For ds: Super Mario, Nintendogs, Mario Cart, Brain Age, Animal Crossing 
- X360: Kinect Adventures, Grand Theft Auto(multiple editions), Call of Duty(multiple editions), Halo 
- Ps3: Grand Theft Auto(multiple editions), Call of Duty(multiple editions)
- Ps2: Grand Theft Auto(multiple editions), Gran Turismo 
- snes: Super Mario World (multiple editions)

It can also be observed that the highest grossing game for each platform is often part of one of these successful, recognizable game franchises. This is with the exception of Wii Sports and Wii Fit, which have a play style that can not be adopted to other platforms.

#### Is there a correlation between user reviews and sales for the ps2, gb, and x360 platforms?
The Pearson Correlation Coefficient will be used and can be interpreted as follows:
1. Perfect Positive Linear Relationship (r = 1): Variables move together in a perfect positive linear pattern.
2. Strong Positive Linear Relationship (0.7 ≤ r < 1): Variables move together strongly but not perfectly.
3. Moderate Positive Linear Relationship (0.3 ≤ r < 0.7): Variables move together moderately.
4. Weak Positive Linear Relationship (0 < r < 0.3): Variables move together weakly.
5. No Linear Relationship (r ≈ 0): No consistent linear pattern between variables.
6. Weak Negative Linear Relationship (-0.3 < r < 0): Variables move oppositely weakly.
7. Moderate Negative Linear Relationship (-0.7 < r < -0.3): Variables move oppositely moderately.
8. Perfect Negative Linear Relationship (r = -1): Variables move together in a perfect negative linear pattern.

#### PS2

In [21]:
# Create a data frame with only ps2 games 
df_ps2 = df_filtered.query("platform == 'ps2'")

# Calculate the correlation 
df_ps2['user_score'].corr(df_ps2['total_sales'])

0.1763636445969453

In [22]:
# Plot a dependency graph for ps2 game sales and user reviews
x = df_ps2['user_score']
y = df_ps2['total_sales']

# Create a scatter plot with Plotly Express
fig_ps2= px.scatter(x=x, y=y, title='Dependency Graph 1: PS2 User Ratings and Sales', labels={'x': 'User Rating', 'y': 'Total Sales (USD Million)'})

# Show the plot
fig_ps2.show()


Based on the correlation coefficient of 0.1763, it can be concluded that there is a weak positive linear  correlation between user score and total sales for ps2 games. 

#### GB

In [23]:
# Create a data frame with only gb games 
df_gb = df_filtered.query("platform == 'gb'")

# Calculate the correlation 
df_gb['user_score'].corr(df_gb['total_sales'])

0.2013915923672528

In [24]:
# Plot a dependency graph for gb game sales and user reviews
x = df_gb['user_score']
y = df_gb['total_sales']

# Create a scatter plot with Plotly Express
fig_gb= px.scatter(x=x, y=y, title='Dependency Graph 2: GB User Ratings and Sales', labels={'x': 'User Rating', 'y': 'Total Sales (USD Million)'})

# Show the plot
fig_gb.show()

Based on the correlation coefficient of 0.20139, it can be concluded that there is a weak positive linear  correlation between user score and total sales for ps2 games. 

#### X360

In [25]:
# Create a data frame with only x360 games 
df_x360 = df_filtered.query("platform == 'x360'")

# Calculate the correlation 
df_x360['user_score'].corr(df_x360['total_sales'])

0.0637637853717711

In [26]:
# Plot a dependency graph for x360 game sales and user reviews
x = df_x360['user_score']
y = df_x360['total_sales']

# Create a scatter plot with Plotly Express
fig_x360= px.scatter(x=x, y=y, title='Dependency Graph 3: X360 User Ratings and Sales', labels={'x': 'User Rating', 'y': 'Total Sales (USD Million)'})

# Show the plot
fig_x360.show()

Based on the correlation coefficient of 0.0637, it can be concluded that there is a very weak positive linear correlation between user score and total sales for ps2 games. 

#### How do the sales of Grand Theft Auto V on ps3 compare to sales of the same game on other platforms?

In [27]:
# Create a data frame only containing releases of the game grand theft auto v 
df_gta= df_filtered.query("name=='grand theft auto v'")

# View the data frame 
df_gta

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating,total_sales
16,grand theft auto v,ps3,2013,action,7.02,9.09,0.98,3.96,97.0,8.2,m,21.05
23,grand theft auto v,x360,2013,action,9.66,5.14,0.06,1.41,97.0,8.1,m,16.27
42,grand theft auto v,ps4,2014,action,3.96,6.31,0.38,1.97,97.0,8.3,m,12.62
165,grand theft auto v,xone,2014,action,2.81,2.19,0.0,0.47,97.0,7.9,m,5.47
1730,grand theft auto v,pc,2015,action,0.39,0.69,0.0,0.09,96.0,7.9,m,1.17


In [28]:
# Plot the total sales of gta v by platform
fig5 = px.histogram(df_gta,
                    x='platform',
                    y='total_sales',
                    title='Figure 5: Total Sales of Grand Theft Auto V by Platform',
                    color='platform')
fig5.update_layout(xaxis_title='Platform', yaxis_title='Total Sales (USD Million)')
fig5.show()

As can be observed in the above table and in Figure 5, Grand Theft Auto V had the highest total sales on the ps3 platform version followed by x360, ps4, xone, and pc in that order. It can also be noted that the game was first released on ps3 and x360 in 2013, then for ps4 and xone in 2014, then finally for pc in 2015. 

#### Which genres are the most profitable? Which are the least?

In [29]:
# Create a df with the sum of all sales grouped by genre
genre_sales = df_filtered.groupby('genre')['total_sales'].sum().reset_index()

# Plot the total sales by genre
fig6_1 = px.histogram(genre_sales,
                    x='genre',
                    y='total_sales',
                    title='Figure 6.1: Total Sales by Genre',
                    color='genre')
fig6_1.update_layout(xaxis_title='Genre', yaxis_title='Total Sales (USD Million)')
fig6_1.show()

In [30]:
# Plot the distribution of total sales of games by genre
fig6_2 = px.box(df_filtered, x="genre", y="total_sales", hover_name="name", title='Figure 6.2: Game Sale Variation By Genre')
fig6_2.update_layout(xaxis_title='Genre', yaxis_title='Total Sales (USD Million)',  yaxis=dict(range=[0, 1.75]))
fig6_2.show()

In Figure 6.1, it can be observed that action is the genre the most total sales, with a total of $1665.42 million in sales. The sports genre follows with $1277.39 in sales. Shooter and role-playing games follow in total sales ($981.59M and $915.83M). The genres with the least in total sales are strategy with only $172.57M in sales, puzzle with only $177.14M, and adventure with only $228.55M.  In Figure 6.2, it can be observed that these totals are affected by outliers. The median sales of the various genres tell a slightly different story. 

Genres with the highest median sales:
1. Platform 
2. Shooter
3. Sports
4. Fighting 
5. Role-playing, Racing, and Action (Same Median)

Genres with the lowest median sales:
1. Adventure
2. Puzzle
3. Strategy 

### Analysis by Sales Region
---
In this section, comparisons will be made between the three sales regions: North America, Europe, and Japan. The following factors will be analyzed:
- Platform market share variations by region
- Genre popularity variation across regions 
- Effect ESRB rating by region 

#### Platform Market Share Variations Across Regions

In [31]:
# Create a df with regional sales for each platform
df_market = df_filtered.groupby('platform')[['na_sales', 'eu_sales', 'jp_sales','other_sales']].sum().reset_index()

# Melt the DataFrame to long format for easier plotting with Plotly Express
df_melted = df_market.melt(id_vars='platform', var_name='region', value_name='sales')

# Plot using Plotly Express
fig7 = px.bar(df_melted, x='platform', y='sales', color='region', barmode='group',
             title='Figure 7: Sales by Region and Platform', labels={'sales': 'Sales (USD Million)', 'platform': 'Platform', 'region': 'Region'})

# Display the figure
fig7.show()

It can be observed that the most sales often come from the North American market, followed by the European market then the Japenese market. This is likely a result of variation in population between these sales regions. 

The 5 highest selling platforms in the North American market are (in order):
1. X360
2. PS2
3. WII
4. PS3
5. DS

The 5 highest selling platforms in the European market are (in order):
1. PS2
2. PS3
3. X360
4. Wii
5. PS

The 5 highest selling platforms in the Japanese market are (in order):
1. DS
2. PS
3. PS2 
4. SNES
5. 3DS 

The 5 highest selling platforms in the other markets are (in order):
1. PS2
2. PS3
3. X360
4. Wii
5. DS 

It can be noted that portable devices perform better in the Japanese sales region. 

#### Genre Popularity Variations Across Regions

In [32]:
# Create a df with regional sales for each genre
df_market = df_filtered.groupby('genre')[['na_sales', 'eu_sales', 'jp_sales','other_sales']].sum().reset_index()

# Melt the DataFrame to long format for easier plotting
df_melted = df_market.melt(id_vars='genre', var_name='region', value_name='sales')

# Plot
fig8 = px.bar(df_melted, x='genre', y='sales', color='region', barmode='group',
             title='Figure 8: Sales by Region and Genre', labels={'sales': 'Sales (USD Million)', 'genre': 'Genre', 'region': 'Region'})

# Display the figure
fig8.show()

It can be observed that different game genres sell differently in the various regions. 

In the North American sales region, the most popular genres are action, sports, shooter, misc, and platform in that order. This for the most part aligns with the overall most sold genres, likely because this is the largest market. 

In the European sales region, the most popular genres are action, sports, shooter, racing, and misc in that order. 

In the Japanese sales region, the most sold genre by far is role-playing. With $340.71M in sales, the role-playing genre more than doubles the sales of the second highest selling genre, action ($151.83M in sales in the Japanese region). These two genres are followed by sports, platform, and fighting in that order.

#### ESRB Rating Effect by Region 

In [33]:
# Create a df with regional sales for each genre
df_rate = df_filtered.groupby('rating')[['na_sales', 'eu_sales', 'jp_sales','other_sales']].sum().reset_index()

# Melt the DataFrame to long format for easier plotting
df_melted = df_rate.melt(id_vars='rating', var_name='region', value_name='sales')

# Plot
fig9 = px.bar(df_melted, x='rating', y='sales', color='region', barmode='group',
             title='Figure 9: Sales by Region and Rating', labels={'sales': 'Sales (USD Million)', 'rating': 'Rating', 'region': 'Region'})

# Display the figure
fig9.show()

In [34]:
# Plot the total number of games released per rating
fig10 = px.histogram(df_filtered,
                    x='rating',
                    title='Figure 10: Number of Games Released per Rating')
fig10.update_layout(xaxis_title='Game Rating', yaxis_title='Number of Games Released')
fig10.show()

In Figure 10, it can be observed that the most common ratings are unrated, e, t, m, and e10+ in that order. The other four ratings are rarely used and therefore have negligable sales. 
In the North American sales region, the rating with the most sales is e, with $1274.24M in sales, followed by unrated, with $990.18M in sales, then t. In the European region, the highest sold ratings are e, unrated, and m in that order. In the Japanese Region, the unrated rating far out sold all other ratings, with $734.25M in sales. It is followed by e, with $197.96M, and t, with $150.7M. 

### Hypothesis Testing 
---
This section will utilize hypothesis testing to see if the following two hypotheses are supported through statistical analysis:
1. Average user ratings of the Xbox One and PC platforms are the same. 
2. Average user ratings for the Action and Sports genres are different.

To do this an independent samples t-test will be carried out for each question/hypothesis. A t-test is a statistical method used to determine if there is a significant difference between the means of two groups. This test will look at both the mean and variability of each data set to determine the t-value, which is the difference between the means of two groups relative to the variation within the groups. The t-value is then used to determine the p-value. The p-value represents the probability of observing the data (or more extreme results) if the null hypothesis were true. This p-value will then be compared to the alpha value, a decimal number that represents the probability of incorrectly rejecting the null hypothesis when it is actually true. In the following testing, an alpha value of 0.05 will be used. If the p-value is less than the alpha, the null hypothesis is rejected. If the p-value is greater than alpha, the null hypothesis can not be rejected. 

#### Are the average user ratings of the Xbox One and PC platforms the same?
- Null Hypothesis: There is not a statistcally significant difference between the average user ratings of the Xbox One and PC platforms.
- Alternative Hypothesis: There is a statistcally significant difference between the average user ratings of the Xbox One and PC platforms.

In [35]:
# Isolate the user ratings for each platform
ur_xone = df_filtered.query("platform =='xone'")['user_score'].dropna().reset_index(drop=True)
ur_pc = df_filtered.query("platform =='pc'")['user_score'].dropna().reset_index(drop=True)

alpha = 0.05  # critical statistical significance level

# Perform t-test
results = st.ttest_ind(ur_xone, ur_pc)

print('p-value: ', results.pvalue)

if results.pvalue < alpha:
    print("We reject the null hypothesis")
else:
    print("We can't reject the null hypothesis")


p-value:  2.5642178004769378e-05
We reject the null hypothesis


This t-test returned a p-value of approximately  2.564e−5, which is significantly smaller than the alpha value of 0.05. This  In other words, hypothesis testing does not support the theory that the the average user ratings of the Xbox One and PC platforms are the same. 

#### Are the average user ratings for the Action and Sports genres are different? 
- Null Hypothesis: There is not a statistcally significant difference between the average user ratings for the Action and Sports genres.
- Alternative Hypothesis: There is a statistcally significant difference between the average user ratings for the Action and Sports genres.



In [36]:
# Isolate the user ratings for each genre
ur_action = df_filtered.query("genre =='action'")['user_score'].dropna().reset_index(drop=True)
ur_sports = df_filtered.query("genre =='sports'")['user_score'].dropna().reset_index(drop=True)

alpha = 0.05  # critical statistical significance level

# Perform t-test
results = st.ttest_ind(ur_action, ur_sports)

print('p-value: ', results.pvalue)

if results.pvalue < alpha:
    print("We reject the null hypothesis")
else:
    print("We can't reject the null hypothesis")

p-value:  0.42352830215328674
We can't reject the null hypothesis


This t-test returned a p-value of approximately 0.4235, which is significantly larger than the alpha value of 0.05. In other words, hypothesis testing does not support the theory that the the average user ratings for the action and sport genres are different by a statistically significant amount. 

## Conclusions 
---

In conclusion, the data shows that game console popularity tends to rise and fall within a span of 10 years, with the PC being the most enduring platform with periodic sales spikes. Based on the insights gleamed, it is advisable to stock games from popular game franchises for newer (less than 5 years old) consoles beginning or currently on an upward sales trajectory, especially for the console the game edition is first released on.

When considering game franchises,  Mario Kart, Pokemon, Super Mario, Call of Duty, and Grand Theft Auto are among those with the most impressive sales records. It's important to note that the highest-grossing game for each platform is often part of a successful, recognizable franchise.

The games Wii Sports, Wii Fit (multiple editions),  Mario Kart Wii, and Super Mario Wii (multiple editions) were significant outliers in our data set with total sales ranging from approximately $20M to $82M. This suggests that if a console comes out with a new, unique play style, it would be advisable to stock a lot of any games released for that console from popular game franchises and any early games that really lean in to the unique play style. 

Interestingly, there isn't a strong correlation between user reviews and total sales, indicating that other factors such as marketing, brand recognition, and game play might play a more significant role in sales performance.

Additionally, they might want to explore games in the action and sports genres, which tend to perform well across different regions. However, regional preferences should also be considered, particularly for those in the Japanese market, where role-playing games and portable devices are more popular.