### EDA: Exploratory data analysis
___
Now that our data has been prepared in the preprocessing notebook, we will continue toward the goal of making global video game advertising suggestions.

Detailed preprocessing steps are explained in the Readme and Preprocessing notebook. Missing values were examined, duplicates removed, new columns were created, spelling conventions standardized

In this notebook the processed data will be explored, and then some analysis will be conducted. 

Topics explored:
- How has video game production changed over time?
- Determine the relevant time period for available data
- How long are platform lifespans?
- Look at sales over time
- What are the popular platforms, genres, ratings
- Are trends different in different regions?
- Do critic reviews matter?
- Are differences in sales means statistically significant? Can differences be referred to as chance?


In [261]:
# load necessary packages for data exploration
import pandas as pd
import plotly.express as px
from scipy import stats as st

In [262]:
# personalize/modify pandas

# used max rows None, but it makes scrolling a pain.
pd.set_option('display.max_rows', 32)
pd.set_option('display.max_columns', 32)
pd.set_option('display.precision', 4)
pd.set_option('display.max_colwidth', 132)

In [263]:
# load processed data
game = pd.read_csv('../processed_games.csv')

### Data Exploration
- Goal is to give advice on video game forecasting.
+ is all data relevant to predicting trends
- How many games were released in different years
#### Production timeline

In [264]:
# Look at how many games were released in different years. Is the data for every period significant?
pivot = game.pivot_table(index='year_of_release',
                         values='name', aggfunc='count')
pivot

Unnamed: 0_level_0,name
year_of_release,Unnamed: 1_level_1
1980.0,9
1981.0,46
1982.0,36
1983.0,17
1984.0,14
...,...
2012.0,651
2013.0,544
2014.0,581
2015.0,606


thoughts
____
video game creation appears to go through stages, lasing about a decade <br>
- 1980's "early years" similiar game creation rates throughout. ~16 games/ year<br>
- 1990's "adoption phase" game production steadily increase from 16 in 1990 to 338 in 1999.<br>
- 2000's "peak" production continues to increase, peaking with 1426 games in 2009.<br>
- 2010's "focus" production aburptly declines to about 600 games per year <br>

Data from 2012-2016 appears more significant based on this information. Something changed in the industry to create such a imediate drop in production. 

In [265]:
# going deeper into the significant range
game2 = game.pivot_table(index=['brand', 'mobile', 'platform', 'year_of_release'],
                         values=['total_sales', 'na_sales', 'eu_sales', 'jp_sales', 'other_sales'], aggfunc='sum')
game2

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,eu_sales,jp_sales,na_sales,other_sales,total_sales
brand,mobile,platform,year_of_release,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
atari,0,2600,1980.0,0.67,0.00,10.59,0.12,11.38
atari,0,2600,1981.0,1.96,0.00,33.40,0.32,35.68
atari,0,2600,1982.0,1.65,0.00,26.92,0.31,28.88
atari,0,2600,1983.0,0.34,0.00,5.44,0.06,5.84
atari,0,2600,1984.0,0.01,0.00,0.26,0.00,0.27
...,...,...,...,...,...,...,...,...
sony,1,playstation_vita,2012.0,5.26,2.45,5.94,2.54,16.19
sony,1,playstation_vita,2013.0,2.57,4.05,2.52,1.45,10.59
sony,1,playstation_vita,2014.0,2.45,6.13,1.98,1.34,11.90
sony,1,playstation_vita,2015.0,0.69,4.85,0.39,0.32,6.25


##### Is the data for every period important?

No. It appears that when production declines the next platform is about to be released. Platform lifespan 7-10 years is typical, varies company to company. Platforms that don't create games go away.

Keeping only years after 2012,  but forcusing on these platforms:
computer
xbox one
3ds and wii_u began to drop (new nintendo mobile coming?)
playstation 4
playstatin vita production is decreasing


thoughts
___
Are there tariffs on video games? 
xbox sales are mostly na_america
japan sales are mostly nintendo and playstation
europe sales:
other than the wii and ds, europe only sells playstations


___
#### Platform comparisons
___
how do sales compare over time
Find platforms that used to be popular but now have zero sales. 
How long does it generally take for new platforms to appear and old ones to fade?
How are platform sales trending

In [266]:
# look at how sales varied from platform to platform

# convert pivot table to a list of dictionaries
game2_list = game2.reset_index().to_dict(orient='records')

# Create a bar chart using px.bar
fig = px.bar(
    game2_list,
    x='year_of_release',
    y='total_sales',
    color='platform',
    title='Total Sales by Year and Platform',
    opacity=0.7,
    barmode='group',  # Stack bars for overlapping representation
)
fig.update_traces(width=1)
# Update layout (optional)
fig.update_layout(
    xaxis_title='Year of Release',
    yaxis_title='Total Sales in millions (USD)',
    legend_title='Platform',
    legend_xanchor='right',  # Move legend to the right side
    legend_x=1.2  # Adjust legend position further to the right
)

fig.show()

thoughts
___
Video games sales greatly increased in the mid 1990's. Each time period has a clear champion. playstation ruled 1995-2005. Nintendo 2005-2010. then xbox and playstation again.

In [267]:
# Choose the platforms with the greatest total sales and build a distribution based on data for each year.
game.groupby(['platform'])['total_sales'].sum().sort_values(ascending=False)

platform
playstation_2                          1255.77
xbox_360                                969.86
playstation_3                           935.92
wii                                     907.51
ds                                      806.12
playstation                             730.86
gameboy_advanced                        317.85
playstation_4                           314.14
playstation_portable                    294.05
computer                                259.23
3ds                                     259.00
xbox                                    257.74
gameboy                                 255.46
nintendo_entertainment_system           251.05
nintendo_64                             218.68
super_nintendo_entertainment_system     200.04
game_cube                               198.93
xbox_one                                159.32
2600                                     96.98
wii_u                                    82.19
playstation_vita                         54.07
satu

top 6 platforms with the most sales are:<br>
playstation_2                          1255.77<br>
xbox_360                                969.86<br>
playstation_3                           935.92<br>
wii                                     907.51<br>
ds                                      806.12<br>
playstation                             730.86<br>



In [268]:
# going deeper into the significant range
best_sellers = ['playstation_2', 'xbox_360',
                'playstation_3', 'wii', 'ds', 'playstation']

game3 = game[game['platform'].isin(best_sellers)].groupby(
    ['platform', 'year_of_release'])['total_sales'].sum()

game3

platform  year_of_release
ds        1985.0               0.02
          2004.0              17.27
          2005.0             130.14
          2006.0             119.81
          2007.0             146.94
                              ...  
xbox_360  2012.0              98.18
          2013.0              88.58
          2014.0              34.74
          2015.0              11.96
          2016.0               1.52
Name: total_sales, Length: 67, dtype: float64

In [269]:
# what is going on with this 1985 entry?
game[(game['platform'] == 'ds') & (game['year_of_release'] == 1985)]

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating,brand,mobile,total_sales
15951,strongest tokyo university shogi ds,ds,1985.0,action,0.0,0.0,0.02,0.0,,,na,nintendo,1,0.02


In [270]:
# from the internet, this game came out in 2007, so let's fix that
game.iloc[15951, 2] = 2007


Find platforms that used to be popular but now have zero sales. 

In [307]:
# same distribution with only the desired platforms

# filter the dataframe for only best selling platforms
platforms_to_include = ['playstation_2', 'xbox_360',
                        'playstation_3', 'wii', 'ds', 'playstation']

# filtered dataframe
filtered_game = game[game['platform'].isin(platforms_to_include)]

fig = px.bar(
    filtered_game,
    x="year_of_release",
    y="total_sales",
    color="platform",
    title="Distribution of video game sales by year and platform",
    opacity=0.7,
    labels={
        "year_of_release": "Year: ",
        "total_sales": "Total sales in millions: $"
    }
)
fig.update_traces(width=1,
                  marker_line_width=0)
fig.update_layout(
    title_text="Distribution of Sales by Year and Platform",
    title_x=0.5,  # Center the title
    xaxis_title="Year",
    yaxis_title="Total sales in millions (USD)"
)

fig.show()

Find platforms that used to be popular but now have zero sales: playstation, playstation 2, ds, wii, (almost) playstation 3, (almost) xbox 360 are all great examples.

In [272]:
# how long does it take for for new platforms to appear and old ones to fade?

# number of platforms
platform_count = len(game.groupby(['platform'])['year_of_release'].max())

# average length of platform life
sum(game.groupby(['platform'])['year_of_release'].max() -
    game.groupby(['platform'])['year_of_release'].min()) / platform_count

7.0

The typical lifespan of a video game platform is 7 years. That is the complete lifespan. 

Relevant data period is 2012-2016. If a platforms production has decreased it is going away. There is something going on with 2016 sales numbers (low) that make me believe this data was taken from mid 2016.

In [273]:
# filter out unrelevant data

new_games = game[game['year_of_release'].isin([2012, 2013, 2014, 2015, 2016])]

In [274]:
# which platforms are leading in sales by year
new_games.groupby(['platform', 'year_of_release'])['total_sales'].sum()

platform  year_of_release
3ds       2012.0             51.36
          2013.0             56.57
          2014.0             43.76
          2015.0             27.78
          2016.0             15.14
                             ...  
xbox_360  2016.0              1.52
xbox_one  2013.0             18.96
          2014.0             54.07
          2015.0             60.14
          2016.0             26.15
Name: total_sales, Length: 49, dtype: float64

In [275]:
# which platforms are leading in sales
new_games.groupby(['platform'])['total_sales'].sum()

platform
3ds                     194.61
computer                 62.65
ds                       12.55
playstation_3           286.23
playstation_4           314.14
playstation_portable     11.19
playstation_vita         49.18
wii                      35.37
wii_u                    82.19
xbox_360                234.98
xbox_one                159.32
Name: total_sales, dtype: float64

##### platform summary
sales leaders from 2012-2016 are:
1) playstation 4
2) playstation 3
3) xbox 360
4) 3ds
5) xbox one

shrinking in sales:
3ds
computer
playstation 3
playstation portable got discontinued
playstation vita 
xbox 360
wii
wii_u

increasing in sales:
xbox one
playstation 4
___
##### Popular platform boxplots
___


In [276]:

# select several profitable platforms and box plot game sales
# Build a box plot for the global sales of all games, broken down by platform

# filter for profitable platforms
prof_games = ['xbox_one', 'playstation_4', '3ds']
box_games = new_games[new_games['platform'].isin(prof_games)]

fig = px.box(box_games,
             x="platform",
             y="total_sales",
             title="Game sales distribution for popular platforms",
             width=600,
             height=1000,
             labels={'name': 'Name: ',
                     'total_sales': 'Global sales: '},
             )
fig.update_layout(
    xaxis_title="Platform",
    yaxis_title="Total sales in millions (USD)"
)

fig.show()

thoughts
___
The medians look similiar, 75th quartile says that playstation 4 and xbox one are more likely to have more sales. Let's see if that is real

In [277]:
# box_games is 2012+ for popular platforms

# filter for specific platforms
play = box_games[box_games['platform'] == 'playstation_4']
ds = box_games[box_games['platform'] == '3ds']
xbox = box_games[box_games['platform'] == 'xbox_one']

# store the average values
x_mean = xbox['total_sales'].mean()
p_mean = play['total_sales'].mean()
d_mean = ds['total_sales'].mean()

# display the averages
print(f"Average game sales for xbox one: {x_mean}")
print(f"Average game sales for playstation 4: {p_mean}")
print(f"Average game sales for 3ds: {d_mean}")

Average game sales for xbox one: 0.6450202429149797
Average game sales for playstation 4: 0.8013775510204081
Average game sales for 3ds: 0.4914393939393939


In [278]:
# select several profitable platforms and box plot game sales
# box plot for sale by year, mobile and brand

fig = px.box(new_games,
             x="brand",
             y="total_sales",
             title="Game sales distribution for popular brands",
             width=600,
             height=1000,
             labels={'name': 'Name: ',
                     'total_sales': 'Global sales: '},
             )
fig.update_layout(
    xaxis_title="Style of game",
    yaxis_title="Total sales in millions (USD)"
)

fig.show()

In [279]:
# new_games is the dataframe with all entries after 2012

# filter for specific brands
sony = new_games[new_games['brand'] == 'sony']
nintendo = new_games[new_games['brand'] == 'nintendo']
micro = new_games[new_games['brand'] == 'microsoft']

# store the average values
sony_mean = sony['total_sales'].mean()
nintendo_mean = nintendo['total_sales'].mean()
micro_mean = micro['total_sales'].mean()

# display the averages
print(f"Average game sales for microsoft: {micro_mean}")
print(f"Average game sales for sony: {sony_mean}")
print(f"Average game sales for nintendo: {nintendo_mean}")

print(f"Average game sales for xbox one: {x_mean}")
print(f"Average game sales for playstation 4: {p_mean}")
print(f"Average game sales for 3ds: {d_mean}")

Average game sales for microsoft: 0.732899628252788
Average game sales for sony: 0.45009536784741144
Average game sales for nintendo: 0.5170700636942676
Average game sales for xbox one: 0.6450202429149797
Average game sales for playstation 4: 0.8013775510204081
Average game sales for 3ds: 0.4914393939393939


#### Hypothesis testing
___
Null hypothesis: There is no difference between population means<br>
Alternative hypothesis: Population means are different

We will be using the levene test to test for differences in variances, in order to properly check for differences in means.

T test for independent samples is used because sales samples are not dependent of one another

In [280]:
# function for hypothesis testing

# Test the hypotheses

def t_test(s1, s2, alpha=0.05):
    # find the pvalue from the comparing variances
    p_value_levene = st.levene(s1, s2).pvalue

    # based on the p value set the variance variable rejecting or accepting their difference
    if p_value_levene < alpha:
        var_var = False
    else:
        var_var = True

    # find the pvalue of the means
    pvalue = st.ttest_ind(s1, s2, equal_var=var_var).pvalue

    # based on the pvalue for means
    if pvalue < alpha:
        print(
            f'Reject Null hypothesis, p-value: {pvalue} The sample means are sufficiently different')
    else:
        print(f'We cannot reject the Null hypothesis based on this information. pvalue: {
              pvalue}')

In [281]:
# compare the averages for popular platforms to determine if the population means are significantly different
# based on platform lifetime sales

print("Comparing sample means from xbox_one and playstation_4")
t_test(xbox['total_sales'], play['total_sales'])
print()
print("Comparing sample means from xbox_one and 3ds")
t_test(xbox['total_sales'], ds['total_sales'])
print()
print("Comparing sample means from playstation 4 and 3ds")
t_test(play['total_sales'], ds['total_sales'])

Comparing sample means from xbox_one and playstation_4
We cannot reject the Null hypothesis based on this information. pvalue: 0.17450241259669863

Comparing sample means from xbox_one and 3ds
We cannot reject the Null hypothesis based on this information. pvalue: 0.1344827211541429

Comparing sample means from playstation 4 and 3ds
Reject Null hypothesis, p-value: 0.003907603379878335 The sample means are sufficiently different


In [282]:
# compare the averages for popular platforms to determine if the population means are significantly different
# based on sales after 2013

# comparing brands, not just the most recent popular platforms
print("Comparing sample means from microsoft and sony")
t_test(micro['total_sales'], sony['total_sales'])
print()
print("Comparing sample means from microsoft and nintendo")
t_test(micro['total_sales'], nintendo['total_sales'])
print()
print("Comparing sample means from sony and nintendo")
t_test(sony['total_sales'], nintendo['total_sales'])

Comparing sample means from microsoft and sony
Reject Null hypothesis, p-value: 5.851971019415867e-05 The sample means are sufficiently different

Comparing sample means from microsoft and nintendo
Reject Null hypothesis, p-value: 0.007857266772924072 The sample means are sufficiently different

Comparing sample means from sony and nintendo
We cannot reject the Null hypothesis based on this information. pvalue: 0.26090494512315315


In [283]:
# let's compare style of platform sales, mobile or not mobile
# from the data nintendo appears to be pivoting toward all mobile gaming. Sony has begun discontinuing mobile and microsoft doesnt make mobile games

mobile = new_games[new_games['mobile'] == 1]
not_mobile = new_games[new_games['mobile'] == 0]

In [284]:
# compare the sales averages for platform type (mobile not mobile) to determine if the population means are significantly different
# based on sales after 2013

# comparing brands, not just the most recent popular platforms
print("Comparing sample means from mobile and not mobile games")
t_test(mobile['total_sales'], not_mobile['total_sales'])
print(f"mean of mobile sales: {mobile['total_sales'].mean()}")
print(f"Mean of non-mobile sales: {not_mobile['total_sales'].mean()}")

Comparing sample means from mobile and not mobile games
Reject Null hypothesis, p-value: 5.2617718098736514e-14 The sample means are sufficiently different
mean of mobile sales: 0.3020034542314336
Mean of non-mobile sales: 0.6330764774044033


We can conclude that there is a difference in population means of: playstation 4 sales vs 3ds sales. <br>
Current suggestion: advertise playstation 4 games over 3ds games.
<br>
When comparing brands (sony, nintendo, microsoft)<br>
We can conclude that microsoft has a significant difference to both sony and nintendo. Microsoft data includes xbox 360 and xbox One, Sony includes: playstation 3, playstation 4, playstation vita and playstation portable. Nintendo includes: 3ds, wii, wii_u, ds<br>
<br>
The most significant difference was found in mobile or non-mobile gaming. My suggestion is to advertise non-mobile gaming
<br>
Summarized so far: My suggestion is to advertise exclusively non-mobile games, for xbox one and playstation 4. 

In [285]:
# how do critic and user reviews affect sales, on a single platform

# create the critic reviws plot
fig1 = px.scatter(xbox, x="critic_score",
                  y="total_sales", trendline='ols', color='platform', color_discrete_sequence=["red"])

# create the user reviews plot
fig2 = px.scatter(xbox, x="user_score", y="total_sales",
                  color="platform", color_discrete_sequence=["blue"], trendline='ols')


# combine the two plots
fig1.add_traces(fig2.data)

# add titles
fig1.update_layout(title="Xbox one game sales against critic and user reviews",
                   xaxis_title="Review score",
                   yaxis_title="Total sales in millions (USD)")


fig1.show()

In [286]:
print(f"R^2 for critic reviews is 0.173888, correlation is 0.417")
print(f"R^2 for user reviews is 0.004751, correlation is 0.069")
print()
print("Neither review system would be considered a great fitting trend line,\n however the user reviews has essentially zero correlation, meaning it has no affect on sales")
print('The critic score trend line is in negative sales below a score of 50, let us explore that')

R^2 for critic reviews is 0.173888, correlation is 0.417
R^2 for user reviews is 0.004751, correlation is 0.069

Neither review system would be considered a great fitting trend line,
 however the user reviews has essentially zero correlation, meaning it has no affect on sales
The critic score trend line is in negative sales below a score of 50, let us explore that


In [287]:
# advanced look at critic scores
# filter for scores above 50
above_50 = xbox[xbox['critic_score'] >= 50]

# create the critic reviws plot
fig1 = px.scatter(above_50, x="critic_score",
                  y="total_sales", trendline='ols', color='platform', color_discrete_sequence=["red"])

# add titles
fig1.update_layout(title="Xbox one game sales against higher critic scores (above 50)",
                   xaxis_title="Review score",
                   yaxis_title="Total sales in millions (USD)")

fig1.show()

The correlation did increase, but only to 0.43. still not a well fit line

In [288]:
# compare sales of the sale game across different platforms

# find a game on multiple platforms
new_games['name'].value_counts()

name
lego marvel super heroes                                      9
fifa 14                                                       9
lego jurassic world                                           8
the lego movie videogame                                      8
angry birds star wars                                         8
                                                             ..
final fantasy iii                                             1
yakuza                                                        1
armored core: verdict day                                     1
2 in 1 combo pack: sonic heroes / super monkey ball deluxe    1
haitaka no psychedelica                                       1
Name: count, Length: 1671, dtype: int64

In [289]:
# box plot for fifa15 sales across platforms
fifa_15 = new_games[new_games['name'] == 'fifa 15']

fig = px.bar(fifa_15,
             x="platform",
             y="total_sales",
             title="Fifa 15 sales across different platforms",
             width=600,
             height=1000,
             labels={'name': 'Name: ',
                     'total_sales': 'Global sales: '},
             )
fig.update_layout(
    xaxis_title="Platform",
    yaxis_title="Total sales in millions (USD)"
)

fig.show()

Fifa 15 is a more popular game on the playstation platforms<br>
$10.36million for playstation 3/4<br>
$5.1million for xbox 360/one


In [290]:
# Take a look at the general distribution of games by genre
new_games.groupby('genre').agg(
    {'total_sales': ['count', 'sum', 'mean', 'max', 'min']})

Unnamed: 0_level_0,total_sales,total_sales,total_sales,total_sales,total_sales
Unnamed: 0_level_1,count,sum,mean,max,min
genre,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
action,1031,441.12,0.4279,21.05,0.01
adventure,302,29.43,0.0975,1.66,0.01
fighting,109,44.49,0.4082,7.55,0.01
misc,192,85.04,0.4429,9.18,0.01
platform,85,61.0,0.7176,9.9,0.01
puzzle,28,4.89,0.1746,1.19,0.01
racing,114,51.94,0.4556,7.09,0.01
role-playing,370,192.8,0.5211,14.6,0.01
shooter,235,304.73,1.2967,14.63,0.01
simulation,80,35.12,0.439,9.17,0.01


The most profitable genre is 'shooter' averaging $1.2459 million per game. Second is 'sports' at $0.704 million per game.

The genre's with the biggest wins (highest sales game):<br>
1) action 21.05mil
2) shooter 14.63mil
3) role-playing 14.60mil
4) misc 9.18mil
5) sports 8.58mil

Genre's to avoid advertising:
adventure: 245 titles, average sales 0.0965mil, and best selling game is 1.66mil
puzzle: 17 titles, average sales 0.1865mil, and best selling game is 1.19mil
strategy: 56 titles, average sales 0.18mil, and best selling game is 1.67mil
simulation: 62 titles, average sales 0.35mil, and best selling game is 5.22mil

besides the adventure genre, the more titles a genre has the more profitable it is.

In [291]:
# there is one more feature I would like to look at, rating

# same agg from earlier
new_games.groupby('rating').agg(
    {'total_sales': ['count', 'sum', 'mean', 'max', 'min']})

Unnamed: 0_level_0,total_sales,total_sales,total_sales,total_sales,total_sales
Unnamed: 0_level_1,count,sum,mean,max,min
rating,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
everyone,394,284.61,0.7224,9.9,0.01
everyone_over_ten,306,155.49,0.5081,6.76,0.01
mature,498,510.11,1.0243,21.05,0.01
na,1275,330.82,0.2595,14.63,0.01
teen,411,161.38,0.3927,5.64,0.01


best rating is mature, by a lot actually (2013+ data)

In [292]:
game.groupby('rating').agg(
    {'total_sales': ['count', 'sum', 'mean', 'max', 'min']})

Unnamed: 0_level_0,total_sales,total_sales,total_sales,total_sales,total_sales
Unnamed: 0_level_1,count,sum,mean,max,min
rating,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
adults only,1,1.96,1.96,1.96,1.96
early childhood,8,1.75,0.2187,0.45,0.04
everyone,3992,2437.28,0.6105,82.54,0.01
everyone_over_ten,1419,654.43,0.4612,10.12,0.01
mature,1563,1473.79,0.9429,21.05,0.01
na,6764,2846.91,0.4209,40.24,0.0
rating_pending,3,0.09,0.03,0.04,0.01
teen,2959,1491.5,0.5041,12.84,0.01


All time data mature is still the place to be.

So far I would say:
mature rating
available one playstation or xbox
shooter role-playing or action
non-mobile

In [None]:
# my choice filters
choice = new_games[(new_games['rating'] == 'mature') &
                   (new_games['platform'].isin(['xbox_one', 'playstation_4'])) &
                   (new_games['genre'].isin(['shooter', 'action', 'role-playing'])) &
                   (new_games['mobile'] == 0)]

print(f"Average of all games: {new_games['total_sales'].mean():.2f} million")
print(f"Average of choice games: {choice['total_sales'].mean():.2f} million")

Average of all games: 0.50 million
Average of choice games: 1.45 million


Might not be the best combination, but it is almost 3x average sales
___
#### Top 5 selling platforms by region
___

In [294]:
# find the top 5 sales platforms by region

# north america
top_sellers_na = new_games.groupby(
    'platform')['na_sales'].sum().nlargest(5).index

# filter for top 5 platforms
df_top_sellers_na = new_games[new_games['platform'].isin(top_sellers_na)]

# add column to mark region
df_top_sellers_na['region'] = 'north_america'

# japan
top_sellers_jp = new_games.groupby(
    'platform')['jp_sales'].sum().nlargest(5).index

# filter for top 5 platforms
df_top_sellers_jp = new_games[new_games['platform'].isin(top_sellers_jp)]

# add column to mark region
df_top_sellers_jp['region'] = 'japan'

# europe
top_sellers_eu = new_games.groupby(
    'platform')['eu_sales'].sum().nlargest(5).index

# filter for top 5 platforms
df_top_sellers_eu = new_games[new_games['platform'].isin(top_sellers_eu)]

# add column to mark region
df_top_sellers_eu['region'] = 'europe'

# other
top_sellers_oth = new_games.groupby(
    'platform')['other_sales'].sum().nlargest(5).index

# filter for top 5 platforms
df_top_sellers_oth = new_games[new_games['platform'].isin(top_sellers_oth)]

# add column to mark region
df_top_sellers_oth['region'] = 'other'

# merge them all together
region_sales = pd.concat([df_top_sellers_oth, df_top_sellers_eu,
                         df_top_sellers_jp, df_top_sellers_na], ignore_index=True)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/

In [295]:
# create individual bar charts
fig_eu = px.bar(df_top_sellers_eu,
                x='eu_sales',
                y='region',
                orientation='h',
                color='platform')

fig_jp = px.bar(df_top_sellers_jp,
                x='jp_sales',
                y='region',
                orientation='h',
                color='platform')

fig_na = px.bar(df_top_sellers_na,
                x='na_sales',
                y='region',
                color='platform')

fig_oth = px.bar(df_top_sellers_oth,
                 x='other_sales',
                 y='region',
                 orientation='h',
                 color='platform')


# merge charts together
additional_figures = [fig_jp, fig_na, fig_oth]
for fig in additional_figures:
    fig_eu.add_traces(fig.data)

fig_eu.update_traces(marker_line_width=0)

# add titles
fig_eu.update_layout(title="Top 5 selling platforms by region",
                     xaxis_title="Total sales in millions (USD)",
                     yaxis_title="Region")
fig_eu.show()

North America:<br>
Major video game market<br>
#1 in xbox sales<br>

Japan:<br>
top 5 platforms are all japanese companies

Europe:<br>
More reflective of north america sales

___
### Top 5 selling genres by region
___

In [297]:
# find the top 5 sales genres by region

# NORTH AMERICA
# find top 5 genres
top_genre_na = new_games.groupby(
    'genre')['na_sales'].sum().nlargest(5).index

# filter for top 5 genres
df_top_genre_na = new_games[new_games['genre'].isin(top_genre_na)]

# add column to mark region
df_top_genre_na['region'] = 'north_america'

# JAPAN
# find top 5 genres
top_genre_jp = new_games.groupby(
    'genre')['jp_sales'].sum().nlargest(5).index

# filter for top 5 genres
df_top_genre_jp = new_games[new_games['genre'].isin(top_genre_jp)]

# add column to mark region
df_top_genre_jp['region'] = 'japan'

# EUROPE
top_genre_eu = new_games.groupby(
    'genre')['eu_sales'].sum().nlargest(5).index

# filter for top 5 genres
df_top_genre_eu = new_games[new_games['genre'].isin(top_genre_eu)]

# add column to mark region
df_top_genre_eu['region'] = 'europe'

# other
top_genre_oth = new_games.groupby(
    'genre')['other_sales'].sum().nlargest(5).index

# filter for top 5 genres
df_top_genre_oth = new_games[new_games['genre'].isin(top_genre_oth)]

# add column to mark region
df_top_genre_oth['region'] = 'other'

# merge them all together
region_genre = pd.concat([df_top_genre_oth, df_top_genre_eu,
                         df_top_genre_jp, df_top_genre_na], ignore_index=True)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/

In [298]:
# create individual bar charts
fig_eu = px.bar(df_top_genre_eu,
                x='eu_sales',
                y='region',
                orientation='h',
                color='genre')

fig_jp = px.bar(df_top_genre_jp,
                x='jp_sales',
                y='region',
                orientation='h',
                color='genre')

fig_na = px.bar(df_top_genre_na,
                x='na_sales',
                y='region',
                color='genre')

fig_oth = px.bar(df_top_genre_oth,
                 x='other_sales',
                 y='region',
                 orientation='h',
                 color='genre')


# merge charts together
additional_figures = [fig_jp, fig_na, fig_oth]
for fig in additional_figures:
    fig_eu.add_traces(fig.data)

fig_eu.update_traces(marker_line_width=0)

# add titles
fig_eu.update_layout(title="Top 5 selling genres by region",
                     xaxis_title="Total sales in millions (USD)",
                     yaxis_title="Region")
fig_eu.show()

North America and Europe<br>
action and shooter are the best selling genres

Europe prefers racing games

Japan has a different hierarchy, role-playing, action, misc, simulation and fighting are the top 5
<br>

___
#### Top 5 selling ratings by region
___

In [299]:
# find the top 5 sales ratings by region

# NORTH AMERICA
# find top 5 rating
top_rating_na = new_games.groupby(
    'rating')['na_sales'].sum().nlargest(5).index

# filter for top 5 rating
df_top_rating_na = new_games[new_games['rating'].isin(top_rating_na)]

# add column to mark region
df_top_rating_na['region'] = 'north_america'

# JAPAN
# find top 5 rating
top_rating_jp = new_games.groupby(
    'rating')['jp_sales'].sum().nlargest(5).index

# filter for top 5 rating
df_top_rating_jp = new_games[new_games['rating'].isin(top_rating_jp)]

# add column to mark region
df_top_rating_jp['region'] = 'japan'

# EUROPE
top_rating_eu = new_games.groupby(
    'rating')['eu_sales'].sum().nlargest(5).index

# filter for top 5 genres
df_top_rating_eu = new_games[new_games['rating'].isin(top_rating_eu)]

# add column to mark region
df_top_rating_eu['region'] = 'europe'

# other
top_rating_oth = new_games.groupby(
    'rating')['other_sales'].sum().nlargest(5).index

# filter for top 5 genres
df_top_rating_oth = new_games[new_games['rating'].isin(top_rating_oth)]

# add column to mark region
df_top_rating_oth['region'] = 'other'

# merge them all together
region_rating = pd.concat([df_top_rating_oth, df_top_rating_eu,
                           df_top_rating_jp, df_top_rating_na], ignore_index=True)

In [300]:
# create individual bar charts
fig_eu = px.bar(df_top_rating_eu,
                x='eu_sales',
                y='region',
                orientation='h',
                color='rating')

fig_jp = px.bar(df_top_rating_jp,
                x='jp_sales',
                y='region',
                orientation='h',
                color='rating')

fig_na = px.bar(df_top_rating_na,
                x='na_sales',
                y='region',
                color='rating')

fig_oth = px.bar(df_top_rating_oth,
                 x='other_sales',
                 y='region',
                 orientation='h',
                 color='rating')


# merge charts together
additional_figures = [fig_jp, fig_na, fig_oth]
for fig in additional_figures:
    fig_eu.add_traces(fig.data)

fig_eu.update_traces(marker_line_width=0)

# add titles
fig_eu.update_layout(title="Top 5 selling game ratings by region",
                     xaxis_title="Total sales in millions (USD)",
                     yaxis_title="Region")
fig_eu.show()

Again the European and North American markets have the same top 5. Japan purchased a lot less 'mature' rated games.
___
### hypothesis testing

In [301]:
# average user ratings for xbox one and pc are the same (null)

# avergae user ratings are different for xbox one and pc games. (alternative)

# create filtered df for platform of choice
pc = new_games[new_games['platform'] == 'computer']
xboxone = new_games[new_games['platform'] == 'xbox_one']

# drop na
pc = pc.dropna()
xboxone = xboxone.dropna()

# averages for platforms
user_avg_pc = pc['user_score'].mean()
user_avg_one = xboxone['user_score'].mean()

# comparing means
print("Comparing sample means from pc games and xbox one of user_scores")
t_test(pc['user_score'], xboxone['user_score'])
print(f"user score mean of pc sales: {user_avg_pc:.2f}")
print(f"user score mean of xbox one sales: {user_avg_one:.2f}")

Comparing sample means from pc games and xbox one of user_scores
We cannot reject the Null hypothesis based on this information. pvalue: 0.5926561176517504
user score mean of pc sales: 64.54
user score mean of xbox one sales: 65.38


thoughts
___
I am not surprised that there was not a significant difference between user scores on differenet platforms. after viewing trendlines for ratings, user ratings appeared to mean nothing toward sales.

In [308]:
# average user ratings for action and sports genres are the same (null)

# avergae user ratings are action and sports genres are different (alternative)

# create filtered df for platform of choice
action = new_games[new_games['genre'] == 'action']
sports = new_games[new_games['genre'] == 'sports']

# drop na
action = action.dropna()
sports = sports.dropna()

# averages for platforms
user_avg_action = action['user_score'].mean()
user_avg_sports = sports['user_score'].mean()

# comparing means
print("Comparing sample means of user_scores for action and sports games")
t_test(action['user_score'], sports['user_score'])
print(f"user score mean of action genre sales: {user_avg_action:.2f}")
print(f"user score mean of sports genre sales: {user_avg_sports:.2f}")

Comparing sample means of user_scores for action and sports games
Reject Null hypothesis, p-value: 1.2971812049381573e-15 The sample means are sufficiently different
user score mean of action genre sales: 68.99
user score mean of sports genre sales: 58.22


The differences between user_score means for the genres, action and sports, are statistically different. That means we can conclude sports games do not have the same appeal as action genre and the difference in mean is not due to chance.

Japan differs from North America and Europe in genre and rating choices. Marketing strategies for North America and Europe should be more similiar.

In [309]:
# create region specific filters

choice_na = new_games[(new_games['rating'] == 'mature') &
                      (new_games['platform'].isin(['xbox_one', 'playstation_4'])) &
                      (new_games['genre'].isin(['shooter', 'action'])) &
                      (new_games['mobile'] == 0)]

choice_eu = new_games[(new_games['rating'] == 'mature') &
                      (new_games['platform'].isin(['playstation_4'])) &
                      (new_games['genre'].isin(['shooter', 'action'])) &
                      (new_games['mobile'] == 0)]

choice_jp = new_games[(new_games['rating'] == 'na') &
                      (new_games['platform'].isin(['playstation_4', '3ds'])) &
                      (new_games['genre'].isin(['role-playing', 'action']))]

print(f"Average of all games: {new_games['total_sales'].mean():.2f} million")
print()
print(f"Average of North America game sales: {
      new_games['na_sales'].mean():.2f} million")
print(f"Average of North America choices: {
      choice_na['na_sales'].mean():.2f} million")
print()
print(f"Average of Europe game sales: {
      new_games['eu_sales'].mean():.2f} million")
print(f"Average of Europe choices: {choice_eu['eu_sales'].mean():.2f} million")
print()
print(f"Average of Japan game sales: {
      new_games['jp_sales'].mean():.2f} million")
print(f"Average of Japan choices: {choice_jp['jp_sales'].mean():.2f} million")

Average of all games: 0.50 million

Average of North America game sales: 0.20 million
Average of North America choices: 0.64 million

Average of Europe game sales: 0.18 million
Average of Europe choices: 0.83 million

Average of Japan game sales: 0.07 million
Average of Japan choices: 0.21 million


#### Conclusion:
___
Regional sales are different enough to make specific ad campaigns. 

North America sales favor:<br>
mature ratings<br>
action or shooter genre<br>
modern non-mobile platforms<br>

Europe sales favor:<br>
mature ratings<br>
action and shooter genre, with racing making an appearance in top 5<br>
non-mobile platforms, prefers playstation platform<br>

Japan sales favor:<br>
na rating is the most popular. Mature is still in the top 5<br>
action and role-playing genres are preferred<br>
preference for japan platforms, and larger mobile game market<br>

These are historical tendencies for the regions. New platforms can change that. Wii (nintendo) sales were really good in Europe, while other Nintendo platforms lagged behind. Lifespan for platforms is about 7 years

Further exploration ideas:

If a game has zero sales in a market, was that game ever introduced in that market? This is an internet video game company that has global clients, so I am guessing any game could be purchased anywhere. Does xbox release games in other languages? xbox sales lack in both european and japanese markets. 
