In [None]:
import matplotlib.pyplot as plt          #importing some packages
import numpy as np
import pandas as pd
import seaborn as sns

vgsales=pd.read_csv('Datasets/vgsales.csv', index_col=0)     #read in the csv
print(vgsales.head()) #We'll get a preliminary look at the data

Wii Sports is more than double in global sales than the next highest entry.
Fun fact, this is due to the fact that Wii Sports came with almost every Wii sold and the Wii is one of the greatest selling 
game consoles of all time, beat out only by the Playstation and Playstation 2 (not counting handhelds).So with that advantage 
over the other games, I consider Wii Sports an anomaly and will remove it.

In [None]:
vgsales=vgsales[1:] #Remove the first entry of the zero-indexed DataFrame.
#Conveniently this data is ordered from highest global sales to lowest global sales.
print(vgsales.head())
print(vgsales.tail())

With the anomaly removed, lets get a first look at how the data are distributed. A histogram is a good way to get a quick look
at the dataset's distribution.

In [None]:
_=plt.hist(vgsales['Global_Sales'])
_=plt.xlabel('Global Sales')
_=plt.ylabel('Number of Games')
plt.show()

Most games on this list sold less than 5 million copies. It may be more interesting to view the data for the more successful games. Let's look at median sales in this DF to see what we can count as successful. With all these data points an ECDF may be more appropriate to show our data's  distribution rather than a swarm plot or scatter plot. Histogram's also suffer from binning bias. An ECDF will show us each point is a nicely visible manner.

In [None]:
def ecdf(series):
    """Compute ECDF for a series of data"""
    points=len(series)
    x=np.sort(series)
    y=np.arange(1.0, points+1.0) / points
    return x, y
#This function generates x and y values which we can then plot to get our ECDF
x_global, y_global=ecdf(vgsales['Global_Sales'])
_=plt.plot(x_global, y_global, marker='.', linestyle='none')
_=plt.xlabel('Global Sales (Million)')
_=plt.ylabel('ECDF')
plt.show()

The sales are extremely bottom heavy, with more than 80% of the data with less than 2.5 million sales. From this graph, we are presented with an idea for a hypothesis test. Let's see which genres of games have the greatest probability of being "successful". We can separate each genre into a separate DF then view their respective ECDFs to see how bottom-heavy each is. Then we can perform a hypothesis test once we decide our criteria for success in sales.

In [None]:
genre_names=vgsales['Genre'].unique().tolist() #Get a list of genres for keys for our dictionary.
genre_dict={name: vgsales.loc[vgsales['Genre']==name] for name in genre_names}  #Loop to create our dictionary of 
#genre-separated DFs
print(genre_dict['Action'].info()) #Making sure it worked.
#Now let's make a loop to create an ECDF for each genre DF.
for name in genre_dict:
    x_genre, y_genre=ecdf(genre_dict[name]['Global_Sales'])
    _=plt.plot(x_genre, y_genre, marker='.', linestyle='none')
    _=plt.xlabel(name + ' Genre Global Sales (Million)')
    _=plt.ylabel('ECDF')
    plt.show()

At first look, the genres Shooter, Fighting, and Strategy seem to have less steep slopes on their ECDFs than the others. However, we must pay close attention to the x-axis of these plots. The Strategy genre has its first tick at 2mil and the Strategy genre has its first tick at 1mil; so even though the shape of these genres' ECDFs look less steep, they likely have a similar percentage of entries beneath our marker of success to the other genres. Let's take 1 million to be our marker of success.

In [None]:
def over_1mil(series):
    """Calculates the percentage of values over 1 million in a series where 
    values are in millions"""
    return float(len(series[series >= 1.000]))/float(len(series))*100.000
print(over_1mil(genre_dict['Action']['Global_Sales']))#Testing. It works!
#Let's loop through each genre and get a list of their over 1mil percentages
for name in genre_dict:
    percentage=over_1mil(genre_dict[name]['Global_Sales'])
    print(name + ": " + str(round(percentage, 2)) + "% of games sold over 1mil"
    + "copies out of " + str(len(genre_dict[name])) + " total games")

It may be interesting to see if there is a correlation between the number of games in a genre and the percentage that sell over 1 million. To the plots!

In [None]:
for name in genre_dict:
    x_perc=over_1mil(genre_dict[name]['Global_Sales'])
    y_games=len(genre_dict[name])
    _=plt.plot(x_perc, y_games, marker='.', linestyle='none')
_=plt.xlabel('Percentage of Games with Sales over 1 Million')
_=plt.ylabel('Number of Games in Genre')
plt.show()

No correlation between number of games in a genre and the percentage of games that sell over 1 million is apparent from the plots.
However, looking at the previous statements created, shooting, platforming, and fighting games are shown to be the safest genre, or, the most likely of the genres to sell over 1 million copies. Next, let's do some hypothesis testing to see how often these genres come out on top in a large number of trials testing the percentage of games that sold over 1mil with the bootstrapping method. This involves selecting an equal number of data points within each genre from the existing data points. When the first data point is randomly selected, whichever point is selected in then replaced within the dataset we are sampling from. We are sampling with replacement. This new sampled dataset is called a bootstrap sample.
First we define our bootstrap replicate function to create each individual replicate. In this case, each replicate is the percentage value from our over_1mil() function.

In [None]:
def bootstrap_rep(series, func):
    """Generate bootstrap replicate of a series"""
    bs_sample=np.random.choice(series, len(series)) #This creates a bootstrap sample.
    return(func(bs_sample)) #We use the bootstrap sample to generate our desired summary statistic, which creates the 
    #bootstrap replicate.
print(bootstrap_rep(genre_dict['Action']['Global_Sales'], over_1mil)) #it worked!

Now that we have made a bootstrap replicate, we will do many trials and see how each genre is distributed with many replicates.

In [None]:
bs_replicates=np.empty(10000) #Create an empty array to store our bootstrap replicates.
bs_rep_dict={name: np.empty(10000) for name in genre_names} #Create a new dictionary to store the bs reps for each genre.
for name in genre_dict: #Loop through each genre.
    for i in range(10000): #and generate 10,000 bootstrap replicates.
        bs_rep_dict[name][i]=bootstrap_rep(genre_dict[name]['Global_Sales'], over_1mil)
print(bs_rep_dict) #it worked!

Let's take a look at the distributions for each genre. Then we can move on to ranking the top three safest genres for each replicate iteration and see which genres have the highest likelihood of success in sales.

In [None]:
legend_labels=[]
for name in bs_rep_dict:
    x_bs_rep, y_bs_rep=ecdf(bs_rep_dict[name])
    _=plt.plot(x_bs_rep, y_bs_rep, marker='.', linestyle='none')
    legend_labels.append(name)
    _=plt.xlabel('Percentage of Games with Sales over 1 Million')
    _=plt.ylabel('ECDF')
    _=plt.legend(legend_labels)
plt.show()

The ECDF provides some insights but something easier to digest and look at is a confidence interval. This describes a range of values of which there is a specified probability of a parameter having a value within the range. Let's calculate the 95% confidence interval (the range for which the genre is 95% likely to have an over 1 million sold percentage parameter for) for each genre.

In [None]:
percentile_dict={name: name for name in genre_names}
for name in bs_rep_dict:
    percentile_dict[name]=np.percentile(bs_rep_dict[name], [2.5, 97.5])
for name in percentile_dict:
    print(str(name)+": "+str(percentile_dict[name]))

From this it is plain to see that Platform and then Shooter games will almost always be the safest two (Shooter, with the second highest lower end on its 95% interval, has no other genre besides Platform with an upper end of its respective 95% interval greater than its lower end value).
Let us confirm this with a hypothesis test. We will take each indice from every genre and compare them and see which genre has the highest percentagefor each iteration. We will have 10,000 winners, of which I set forth the hypothesis that 95% or greater will be Shooter or Platform.

In [None]:
#First we will set up our winner list:
bs_highest_percentage=[]
#and second and third highest percentages:
second_highest_percentage=[]
third_highest_percentage=[]
for i in range(10000): #Loop through each iteration in our bootstrapping.
    #For this loop we create another dictionary to store the bs replicate percentage value for each iteration in our 
    #10,000 trials. Then we pull thekey (genre) from the max value in the iteration. Then we delete the key:value pair 
    #with that max value and obtain the key from the new max value,giving us the key corresponding to the second highest 
    #percent value. We repeat this process to obtain the third highest value.
    bs_max={name: bs_rep_dict[name][i] for name in genre_names}
    bs_highest_percentage.append(max(bs_max, key=bs_max.get))
    bs_max.pop(bs_highest_percentage[i])
    second_highest_percentage.append(max(bs_max, key=bs_max.get))
    bs_max.pop(second_highest_percentage[i])
    third_highest_percentage.append(max(bs_max, key=bs_max.get))
print(set(bs_highest_percentage))
print(set(second_highest_percentage))
print(set(third_highest_percentage))

Through 10,000 trials only the genres Platform and Shooter had the highest percentage of games sold over 1 million.

Below is a different method I came up with to get the top 3 positions in our 10,000 trials:

"""for i in range(10000): #Loop through each iteration in our bootstrapping
    bs_max={name: bs_rep_dict[name][i] for name in genre_names}
    bs_max_list=list(bs_max.values())
    bs_max_list.sort(reverse=True)
    for genre in bs_max:
        if bs_max[genre]=bs_max_list[0]:
            bs_highest_percentage.append(bs_max, key=bs_max.get)
        if bs_max[genre]=bs_max_list[1]:
            second_highest_percentage.append(bs_max, key=bs_max.get)
        if bs_max[genre]=bs_max_list[2]:
            third_highest_percentage.append(bs_max, key=bs_max.get)"""
            
Now we will count the occurrences of each genre in each placing.

In [None]:
df_dict={"1st":bs_highest_percentage, "2nd":second_highest_percentage, "3rd":third_highest_percentage}
percentage_placing_df=pd.DataFrame(df_dict)
print(percentage_placing_df.info())
#Now lets see how many of each entry was generated in each column.
first = percentage_placing_df['1st'].value_counts()
second = percentage_placing_df['2nd'].value_counts()
third = percentage_placing_df['3rd'].value_counts()
print(first)
print(second)
print(third)

As hypothesized, the first placing in our trials went to Platform games ~95% of the time, with Shooter games taking the top spot the remaining ~5% and also dominates the second place ranking. No other genre got the first placing even 1 time in our 10,000 trials! Fighting and Racing games are pretty evenly represented in the third highest rank and had a small number of second place rankings each.
Let's plot and get a better visual feel for the results.

In [None]:
sns.barplot(first.index, first.values)
plt.title('First Placing of Highest Percentage of Games Sold Over 1 Million '+
          'in 10,000 Trials')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Genre', fontsize=12)
plt.show()
sns.barplot(second.index, second.values)
plt.title('Second Placing of Highest Percentage of Games Sold Over 1 Million '+
          'in 10,000 Trials')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Genre', fontsize=12) 
plt.show()
sns.barplot(third.index, third.values)
plt.title('Third Placing of Highest Percentage of Games Sold Over 1 Million '+
          'in 10,000 Trials')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Genre', fontsize=12)
plt.show()  

The plot reconfirms Platform games as the obvious choice for safest game to develop (if the goal is to sell over 1 million copies). Shooter games are a clear second safest choice with Fighting games followed closely by Racing games as the next two safest choices.
This concludes our analysis of safest games to produce by genre within the vgsales csv file.
