# Game Blockbuster Potential: Analyzing the Success Rate of Action Games Across Gaming Platforms

### Contribution Report

<i>1. Johnny Powell completed parts 1, 2, and 4</i>

<i>2. Madisen LeShoure completed part 1 and 4</i>

<i>3. Jeevan Rathakrishnan completed part 3 and 4</i>

# 1. Introduction

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import math
import numpy as np
df = pd.read_csv('video_games.csv', encoding='unicode_escape')

In [2]:
df.columns

Index(['ï»¿Console', 'Title', 'US Sales (millions)', 'Block4', 'Block2',
       'Block1', 'Block0.5', 'YearReleased', '2004', '2005',
       ...
       'Tricks', 'Volleyball', 'Wakeboarding', 'Wrestling', 'FirstPerson',
       'Platform', 'Isometric', 'SideScrolling', 'TopDown', 'ThirdPerson'],
      dtype='object', length=166)

Our dataset is from the University of Portsmouth collected and curated by Dr. Joe Cox. This dataset focuses on video games and their statistics, containing data about the sales, gameplay characteristics, and playtime of over a thousand video games released between 2004 and 2010. The original dataset was used to provide insights into the factors that contribute to a video game becoming a blockbuster. Playtime data was collected via crowd sourcing. The dataset contains 1770 instances (rows) of games and 166 columns, each column recording the characteristics of each video game and the block-buster (sales) title each game achieved (if achieved). Our dataset represents a sample rather than a complete population. Due to the fact that the video game industry is vast and constantly evolving, our dataset only contains data within a specific time frame (2004 - 2010). It would be very challenging to collect data on every single game ever made. Our population of interest is all action games released between 2004 - 2010 across multiple gaming platforms. Because our dataset represents only a sample of video games, it would be best to take a random sample from all of the action video games we have so that our results are representative, generalizable and void of any potential biases.
For our analysis we would like to examine the characteristics that contribute to a video game getting blockbuster status. To do so we aim to answer these questions: Is the proportion of action games that achieve Block0.5 status significantly less than 20%? & What are the mean and median sale amounts of action games that achieve blockbuster(0.5) status compared to those that do not?
It can be extremely difficult to develop a game and it be successful. The gaming industry is very competitive and is evolving fast, so it is important for developers, publishers, platform holders, and investors to understand the factors that contribute to a game's success. Action games are one of the most popular gaming genres and they represent a large portion of the gaming market. The answers to our research questions could provide valuable insights that could impact decision-making across the industry. By providing data-driven insights into how specific factors contribute to a games success we can provide developers, publishers, platform holders, and investors with a blueprint for creating successful video games. Our research questions have the potential to influence which specific characteristics are incorporated into games, investment strategies for action games, identify how different characteristics incorporated into games perform on specific consoles, and overall how to combine all factors to create more successful action games.

Null Hypothesis (Ho): The proportion of games in the 'Action' genre that achieve "Block0.5" status is equal to 20% (p = 0.2). 


Alternative Hypothesis (Ha): The proportion of games in the 'Action' genre that achieve "Block0.5" status is less than 20% (p < 0.2).

## Confidence Interval Analytical Tasks

Focus on the proportion of action games that are Block0.5 compared to population proportion

## Research Question
Is the proportion of successful games (Block0.5 == True) significantly less than 20%?

## Dataset cleaning 
I used the function from our last project to convert all of the columns that only record a 0 or a 1 to be Boolean. We don't need to remove any missing or null values, as there are none present in the dataset. I also categorized our sample to only look for action games with Block0.5 (success) status.

In [3]:
def make_boolean(var, true_condition):
    return (var == true_condition)

boolean_columns = ['Block0.5', 'Block1', 'Block2', 'Block4']

for col in df.columns:
    if set(df[col].unique()) == {0, 1}:
        df[col] = make_boolean(df[col], 1)

print(f"{df.isnull().sum().sum()} missing values")

0 missing values


In [4]:
action_games = df[df["Action"] == True]
successes = df[df["Block0.5"] == True]
success_block = action_games['Block0.5']
blockbuster = action_games[action_games['Block0.5'] == True]['US Sales (millions)']
non_blockbuster = action_games[action_games['Block0.5'] == False]['US Sales (millions)']

## Descriptive Analytics

The dataset contains 1,076 action games (our sample size), and the proportion of Block0.5 action games is 0.2388. In other words, around 23.88% of the action games were considered at least mildly successful.

In [5]:
int((df['Block0.5'] == 1).sum())

382

In [6]:
float((df['Block0.5'] == 1).sum() / 1770)

0.21581920903954802

In [7]:
sample_size = len(success_block)
sample_proportion = success_block.mean()
print(f"Sample Size: {sample_size}")
print(f"Proportion of successful action games: {sample_proportion:.4f}")

Sample Size: 1076
Proportion of successful action games: 0.2388


In [8]:
sample_size_blockbuster = len(blockbuster)
sample_mean_blockbuster = np.mean(blockbuster)
print(f"Sample Size for Blockbuster Games: {sample_size_blockbuster}")
print(f"Mean Sale Amount for Blockbuster Games: {sample_mean_blockbuster:.2f}")
sample_size_non_blockbuster = len(non_blockbuster)
sample_mean_non_blockbuster = np.mean(non_blockbuster)
print(f"Sample Size for Non-Blockbuster Games: {sample_size_non_blockbuster}")
print(f"Mean Sale Amount for Non-Blockbuster Games: {sample_mean_non_blockbuster:.2f}")

Sample Size for Blockbuster Games: 257
Mean Sale Amount for Blockbuster Games: 1.63
Sample Size for Non-Blockbuster Games: 819
Mean Sale Amount for Non-Blockbuster Games: 0.17


In [9]:
simulated_means = []
for i in range(5000):
    sample = success_block.sample(sample_size, replace = True)
    simulated_means.append(sample.mean())
simulated_means = pd.DataFrame({"x": simulated_means})
simulated_means

Unnamed: 0,x
0,0.243494
1,0.226766
2,0.242565
3,0.233271
4,0.276022
...,...
4995,0.250000
4996,0.233271
4997,0.235130
4998,0.229554


In [10]:
simulated_means_blockbuster = []
for i in range(5000):
    sample = blockbuster.sample(sample_size_blockbuster, replace=True)
    simulated_means_blockbuster.append(sample.mean())
simulated_means_blockbuster = pd.DataFrame({"mean_sale_amount": simulated_means_blockbuster})
simulated_means_non_blockbuster = []
for i in range(5000):
    sample = non_blockbuster.sample(sample_size_non_blockbuster, replace=True)
    simulated_means_non_blockbuster.append(sample.mean())
simulated_means_non_blockbuster = pd.DataFrame({"mean_sale_amount": simulated_means_non_blockbuster})

In [11]:
simulated_medians_blockbuster = []
for i in range(5000):
    sample = blockbuster.sample(sample_size_blockbuster, replace=True)
    simulated_medians_blockbuster.append(sample.median())
simulated_medians_blockbuster = pd.DataFrame({"median_sale_amount": simulated_medians_blockbuster})
simulated_medians_non_blockbuster = []
for i in range(5000):
    sample = non_blockbuster.sample(sample_size_non_blockbuster, replace=True)
    simulated_medians_non_blockbuster.append(sample.median())
simulated_medians_non_blockbuster = pd.DataFrame({"median_sale_amount": simulated_medians_non_blockbuster})

In [12]:
lower_bound = np.quantile(simulated_means, 0.05)
upper_bound = np.quantile(simulated_means, 0.95)
print(f"90% Confidence Interval for Means: [{lower_bound:.4f}, {upper_bound:.4f}]")

90% Confidence Interval for Means: [0.2175, 0.2593]


In [13]:
lower_bound_blockbuster = np.quantile(simulated_means_blockbuster, 0.05)
upper_bound_blockbuster = np.quantile(simulated_means_blockbuster, 0.95)
print(f"90% Confidence Interval for Blockbuster Mean Sale Amount: [{lower_bound_blockbuster:.2f}, {upper_bound_blockbuster:.2f}]")
lower_bound_non_blockbuster = np.quantile(simulated_means_non_blockbuster, 0.05)
upper_bound_non_blockbuster = np.quantile(simulated_means_non_blockbuster, 0.95)
print(f"90% Confidence Interval for Non-Blockbuster Mean Sale Amount: [{lower_bound_non_blockbuster:.2f}, {upper_bound_non_blockbuster:.2f}]")

90% Confidence Interval for Blockbuster Mean Sale Amount: [1.44, 1.83]
90% Confidence Interval for Non-Blockbuster Mean Sale Amount: [0.17, 0.18]


In [14]:
if upper_bound < 0.2:
    print("The proportion of successful action games is significantly less than 20%.")
else:
    print("The proportion of successful action games is NOT significantly less than 20%.")

The proportion of successful action games is NOT significantly less than 20%.


# Hypothesis Testing Analytical Tasks

Null Hypothesis (Ho): The proportion of games in the 'Action' genre that achieve "Block0.5" status is equal to 20% (p = 0.2). 

Alternative Hypothesis (Ha): The proportion of games in the 'Action' genre that achieve "Block0.5" status is less than 20% (p < 0.2).

For our hypothesis test, we would like to know: Is it true that the proportion of action games that achieve Block0.5 status is less than 20%? and what are the mean & median sales numbers of action games that achieve Block0.5 compared to action games that do not acheive Blockbuster status?
To begin our hypothesis test we performed some data cleaning. Due to the dataset containing a large number of boolean (0, 1) values we created a function that would convert the values to True/False values. After converting the boolean values we performed data cleaning for our variables of interest, checking for null and missing values (.isna().sum()) and unique values (.unique()) within the "Action", "Block0.5", & "US Sales (millions)" columns ensuring no null or missing values. We also checked the sales column for 0 & negative values for further data cleaning. However, there were no unusual, missing, or null values found within the columns (no limitations).

In [15]:
int(df['US Sales (millions)'].isna().sum())

0

In [16]:
int((df['US Sales (millions)'] <= 0).sum())

0

In [17]:
int(df['Action'].isna().sum())

0

In [18]:
df['Action'].unique()

array([ True, False])

In [19]:
int(df['Block0.5'].isna().sum())

0

In [20]:
df['Block0.5'].unique()

array([ True, False])

## Descriptive Analytics

To observe the features of our variable of interest we used .describe() for the 'Action' & 'Block0.5' column. These summaries describe the count of instances for each, how many unique values the column contains, the top unique value, and the frequency at which the top unique value occurs.
We subset our data to only include 'Action' games. From our data subset we used mean() and median() to get summary measures for the sales variable. The average sales (millions) for all Action games = 0.52 the median sales for all Action games = 0.2. The average sales for Action games that achieved Block0.5 = 1.62 the median sales for Action games that achieved Block0.5= 1.06. 

In [21]:
df['Block0.5'].describe()

count      1770
unique        2
top       False
freq       1388
Name: Block0.5, dtype: object

In [22]:
df['Action'].describe()

count     1770
unique       2
top       True
freq      1076
Name: Action, dtype: object

In [23]:
action_games = df[df['Action']== True] #Filtering data for Action games
action_games.shape

(1076, 166)

In [24]:
print(action_games['US Sales (millions)'].mean())           #Average sales for all Action games
print(action_games['US Sales (millions)'].median())         #Median sales for all Action games

0.5197026022304834
0.2


In [25]:
block_interest = action_games[action_games['Block0.5'] == True]    #Filtering data for Action games that achieved block0.5 status   

In [26]:
print(block_interest['US Sales (millions)'].mean())           #Average sales of Action block0.5 games 
print(block_interest['US Sales (millions)'].median())

1.6250194552529182
1.06


In [27]:
obs_proportion = float(action_games['Block0.5'].mean() )   #Average amount of Action games with Block0.5 status. 
obs_proportion

0.23884758364312267

## Hypothesis Test

Our parameter of interest is the proportion of 'Action' games with block0.5 status.

Null Hypothesis (Ho): The proportion of games in the 'Action' genre that achieve "Block0.5" status is equal to 20% (p = 0.2). 

Alternative Hypothesis (Ha): The proportion of games in the 'Action' genre that achieve "Block0.5" status is less than 20% (p < 0.2).

Significance level: 0.10

Checking conditions: n * p = 1076*0.20 = 215.2, np >= 10, n*(1-p) = 1076*.80 = 860.8 >=10

Based on our performed hypothesis test, we are 90 percent confident that the true population mean for the proportion of action games in block0.5 status is within the range 0.22 and 0.26. From the visualization, we can understand the shape of the distribution and how it centers around our observed proportion of 0.24. The most appropriate measure of center is the mean because we are using a proportion, and it directly represents the expected value of action games in block0.5 status. Our distribution also appears to be normal, which reaffirms the appropriate measure of center to be the mean.
Based off of our calculated p-value of 0.0022, we would reject the null hypothesis and accept our alternative hypothesis. There is significant statistical evidence that tell us that more than 20 percent of action games receive block0.5 status.
The p-value of 0.0022 indicates a low probability of observing a sample where the proportion of action games achieving Block0.5 status is as high or higher, assuming that the true proportion is only 20%. This small p-value suggests that it is very unlikely that our sample result occurred due to chance. The significance level, set at 0.10, represents the threshold against which we compare the p-value to determine whether the observed result is statistically significant. Since the p-value is less than the significance level, we reject the null hypothesis, concluding that there is strong evidence to support the alternative hypothesis that the proportion of action games achieving Block0.5 status is greater than 20%.

In [None]:
sampling_dist = []
for i in range(5000):
    sample = action_games.sample(1076, replace = True)
    sampling_dist.append(sample['Block0.5'].mean())
sampling_dist = pd.DataFrame({'xbar':sampling_dist})
sampling_dist

In [None]:
print(sampling_dist['xbar'].quantile(0.05))
print(sampling_dist['xbar'].quantile(0.95))

In [None]:
sampling_dist.plot.hist() #Distribution of the proportion of Action games with block0.5 status.
plt.title("Proportion of Action games with block0.5 status")
plt.xlabel("Block0.5")
plt.ylabel("Blockbuster0.5 Achieved Frequency")

In [None]:
mod_data = action_games['Block0.5'] - action_games['Block0.5'].mean() + 0.20 #modifying data to assume the null hypothesis is true.
float(mod_data.mean())

In [None]:
sampling_dist2 = []
for i in range(5000):
    sample = mod_data.sample(1076, replace = True)
    sampling_dist2.append(sample.mean())
sampling_dist2 = pd.DataFrame({'xbar':sampling_dist2})
sampling_dist2

In [None]:
float((sampling_dist2['xbar'] >= obs_proportion).mean()) #Probability of observing a sample statistic greater than the observed proportion.

# Conclusion

Our results indicate that the proportion of successful action games (Block0.5 == True) is not significantly less than 20%, and the true proportion lies within the confidence interval [0.2193, 0.2602]. Additionally, we investigated the mean and median sale amounts of action games based on their blockbuster status. Action games that achieved Block0.5 status had a mean sale amount of 1.625 million and a median sale amount of  1.060 million, both of which were significantly higher than those that did not achieve Blockbuster status. This highlights the association between financial success and achieving blockbuster status in the gaming industry. For our hypothesis test, we aimed to determine whether the proportion of action games achieving Block0.5 status was less than 20%. We conducted a data cleaning process to ensure the dataset was clean, we converted boolean values and checked for missing and unusual values. Using a sample of 1076 action games, we calculated an observed proportion of 0.24 and performed a hypothesis test with a significance level of 0.10. The resulting p-value of 0.0022 led us to reject the null hypothesis, indicating strong evidence that the proportion of action games achieving Block0.5 status is greater than 20%. Additionally, we found that action games with Block0.5 status had significantly higher average and median sales compared to all action games, highlighting the association between success and achieving blockbuster status.

We encountered some limitations during our analysis. While our dataset was comprehensive, it is possible that it does not capture all factors influencing blockbuster status, such like marketing or player demographics. The reliance on sales as a measure of success might overlook other factors of game performance, like acclaim or player engagement. Furthermore, the dataset only included specific consoles which does not fully represent the entire market of gaming. These limitations are important to keep in mind when generalizing results.

Future research could be done to explore more questions and conduct further analyses to build on our findings. One idea is examining the impact of specific game features or genres on achieving blockbuster status to closer identify predictors of success. As well, expanding the dataset to include more diverse consoles and international games could enhance the generalizability of results and offer a more precise understanding of what determines blockbuster success in the gaming industry..