Description of the project
You work in the online store "Stremchik", which sells computer games all over the world. Historical data on game 
sales, user and expert ratings, genres and platforms (for example, Xbox or PlayStation) are available from open 
sources. You need to identify the patterns that determine the success of the game. This will allow you to bet on a 
potentially popular product and plan your advertising campaigns.
You have data up to 2016. Let's say it's December 2016 and you're planning a campaign for 2017. We need to 
work out the principle of working with data. It doesn't matter if you predict 2017 sales based on 2016 data or 
2027 based on 2026 data.
Games.csv data description
Name - the name of the game
Platform - platform
Year_of_Release - release year
Genre - game genre
NA_sales - North American sales (millions of dollars)
EU_sales - Sales in Europe (millions of dollars)
JP_sales - Sales in Japan (millions of dollars)
Other_sales - sales in other countries (millions of dollars)
Critic_Score - score of critics (from 0 to 100)
User_Score - user score (from 0 to 10)
Rating is a rating from the ESRB (Entertainment Software Rating Board) organization. This association 
determines the rating of computer games and assigns them a suitable age category.
Data for 2016 may be incomplete.

In [None]:
#Step 2. Open the data file and study the general information
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore') 
from datetime import datetime
from scipy import stats as st
games = pd.read_csv('/datasets/games.csv') 
games.head()
games.describe()
games.info()

In [None]:
#Step 3. Prepare the data
#It is enough just to convert uppercase letters to lowercase
games.columns = games.columns.str.lower()

In [None]:
#The rest too, so as not to accidentally get confused
games['name'] = games['name'].str.lower()
games['platform'] = games['platform'].str.lower()
games['genre'] = games['genre'].str.lower()
games['rating'] = games['rating'].str.lower()

In [None]:
#Duplicate check - 0
games.duplicated().sum()

In [None]:
#convert to float format, replacing incorrect values with NaN
games['user_score'] = pd.to_numeric(games['user_score'], errors='coerce')

In [None]:
#Fill in the gaps in the column with the year of publication with zeros and
# convert everything to int
games['year_of_release'] = games['year_of_release'].fillna(0)
games['year_of_release'] = games['year_of_release'].astype('int')

In [None]:
#In the column with the names of the games, replace 2 gaps
games['name'] = games['name'].fillna('unknown_name')

In [None]:
#In the column with game genres, replace 2 passes
games['genre'] = games['genre'].fillna('unknown_genre')

In [None]:
#Age rating skips
games['rating'] = games['rating'].fillna('unknown_rating')
games.isnull().sum().sort_values(ascending = False)

Let's check what unnumbered values are found, otherwise suddenly the audience's assessment - they wrote the number in letters
Basically, gaps are observed in the columns 'critic_score', 'user_score' and 'rating'
Most likely they were pulled up by id of games from another database
It is possible that this data is not available for specific platforms (devices N64, SNES, SAT, 2600, GB, NES, GEN, NG, etc.) or years (too old games)
tbd - means that the rating was under discussion, but, apparently, was never published at the end

In [None]:
#Count the total sales in all regions and write them down in a separate column
games['total_sales'] = games['na_sales'] + games['eu_sales'] + games['jp_sales'] + games['other_sales'] 
games.head()

In [None]:
#Step 4. Conduct exploratory data analysis
#Release of games by years
games[games['year_of_release'] != 0]['year_of_release'].hist(bins=40)

The main peak of game release falls on 2008-2010
Games started publishing in the early 1980s
But it took as much as 15-20 years of technology development to start mass production of games. In 
my opinion, we can exclude data up to 2000, because the number of games released in those years is 
insignificant in comparison with further data
The decline is most likely due to the development of mobile devices (smartphones, tablets, etc.) - It 
became easier for users to play during breaks, in queues, on the way to work, etc.

In [None]:
#See how sales have changed by platform
games.pivot_table(index='year_of_release', columns='platform', values='total_sales', aggfunc = 'sum')
games[games['year_of_release'] != 0].pivot_table(index='year_of_release', columns='platform',
values='total_sales', aggfunc='sum').plot(figsize=(10,5))
plt.title('Dynamics of sales of games across all platforms')
plt.show()

In [None]:
#Select the platforms with the highest total sales and plot the distribution by year
platform_sales = games.pivot_table(index='platform', values='total_sales', aggfunc='sum').sort_values(by='total_sales',ascending=False)
#print(platform_sales)
platform_sales_top = platform_sales.query('total_sales > 259')
#print(platform_sales_top)

platform_list = ['ps2', 'x360', 'ps3', 'wii', 'ds', 'ps', 'ps4', 'psp', 'pc']
games_top_platform = games.query('platform in @platform_list')
games_top_platform[games_top_platform['year_of_release'] != 0].pivot_table(index='year_of_release', columns='platform', values='total_sales', aggfunc = 'sum').plot(figsize=(10, 5))
plt.title('Dynamics of game sales by top platforms')
plt.show()

In [None]:
#Find popular platforms in the past that now have sales on zero
#ps, ps2, ds, wii, psp


During what characteristic period do new platforms appear and old platforms disappear?
the example of Xbox and PS shows that the average platform relevance cycle is 8-10 years

Let's create a pivot table in the context of platforms and display the maximum and minimum values of the games released on them
Further, we will exclude the threshold values, thus all the "extreme" platforms for which there is no data before 1980, or their cycle has not yet ended in 2016, will disappear.
By the way, the PC platform will disappear immediately

In [None]:
plat_year_pivot = games[games['year_of_release'] != 0].pivot_table(index='plat form', values='year_of_release', aggfunc = ('max', 'min'))

plat_year_pivot_act = plat_year_pivot.query('(min > 1980) & (max < 2016) & (mi n != max)')
plat_year_pivot_act['platform_lifetime'] = plat_year_pivot_act['max'] - plat_year_pivot_act['min']
print(plat_year_pivot_act)
print(plat_year_pivot_act['platform_lifetime'].mean())

In [None]:
#There is one particularly "long-playing" console - Nintendo DS (28 years old)

In [None]:
#Determine the data for which period you need to take in order to exclude a significant distortion of the distribution by platform in 2016
#if 1995-2015 is the entire current period, to build a forecast, we will only analyze the data of the decline cycle of the gaming market
# starting from the peak in 2009, we will take the period 2009-2015

In [None]:
years_list = []
for element in range(2009,2016): years_list.append(element)
#print(years_list)

games_data = games.query('year_of_release in @years_list')
#print(games_data.head())

In [None]:
#Next, only work with the data that you have defined. Do not include data from previous years
#Which platforms are leading in terms of sales, rising or falling? Choose a few potentially profitable platforms
games_data.pivot_table(index = 'platform', values = 'total_sales', aggfunc = 'sum').sort_values(by='total_sales', ascending=False)

Top sellers for the period under review - PS3, Xbox360 and Wii
However, all of these platforms are already completing their popularity cycle.
PS4 can be considered as potentially profitable, replacing the once popular PS3 and Xbox One, replacing the Xbox 360
sales on PC are not as high as on game consoles, but PC games are relevant at all times

In [None]:
#Plot a box and "mustache" graph of the global sales of each game and breakdown by platform
games_data.describe()

pivot_game_sale = games_data.pivot_table(index='name', columns='platform', values='total_sales', aggfunc = 'sum')
#print(pivot_game_sale)

plot = pivot_game_sale.boxplot(figsize=(10,10)).set_ylim(0,5)
plt.title('A box with a mustache selling games on different platforms')
plt.show()
#plot = pivot_game_sale.boxplot(figsize=(10,10))
#plt.show()

Is there a big difference in sales? What about average sales across different platforms? Describe the result
There are some "successful" games, sales for which exceed 20, 25 and even 30 million dollars, which differs from the average and goes far beyond the "mustache"
In general, the situation is similar for platforms: the lower mustache rests at 0, and the upper one is in the range of $ 1 million for most of the platforms. There are more successful platforms - PS3, Xbox360 and Wiiu with a top mustache of over $ 1 million; and the PS4 and Xbox One platforms are the next generations of the above platforms, the normal distribution of game sales for which can reach $ 2.5 million.

It is clearly seen that with the evolution of a specific platform, sales by games are also growing. For example PS4> PS3> PS2
A XboxOne> Xbox360 Wiiu> Wii

Although we took the data for 2009-2015, the difference in sales can still be affected by an increase in game prices over time, at least due to inflation, as well as an increase in game production costs

In [None]:
#See how user and critical reviews affect sales within one popular platform
games_data_ps2 = games_data.query('platform == "ps2"')
#print(games_data_ps2.head())

games_data_ps2.plot(x='user_score', y='total_sales', kind='scatter', color='bl ue', alpha = 1, figsize=(10,5)).set_ylim(0,3)
plt.title('Impact of user ratings on PS2 game sales')
plt.show()
games_data_ps2.plot(x='critic_score', y='total_sales', kind='scatter', color= 'maroon', alpha = 1, figsize=(10,5)).set_ylim(0,3)
plt.title('The Impact of Critical Ratings on PS2 Game Sales')
plt.show()

In [None]:
#Build a scatterplot and calculate the correlation between reviews and sales
corr_data = pd.DataFrame()
corr_data['total_sales'] = games_data['total_sales']
corr_data['user_score'] = games_data['user_score']
corr_data['critic_score'] = games_data['critic_score']
#print(corr_data.head())

corr_data.plot(x='user_score', y='total_sales', kind='scatter', color='blue', legend=True, alpha = 0.15, figsize=(10,5)).set_ylim(0,20)
plt.title('Impact of user ratings on game sales across all platforms')
plt.show()
corr_data.plot(x='critic_score', y='total_sales', kind='scatter', color='maroo n', legend=True, alpha = 0.15, figsize=(10,5)).set_ylim(0,20)
plt.title(' The Impact of Critical Ratings on Game Sales on All Platforms ')
plt.show()

corr_data[['total_sales', 'user_score',
'critic_score']].corr().style.format(" {:.2%}")

Formulate conclusions and correlate them with sales of games on other platforms

The diagrams for PS2 and general for all platforms are very similar in general
As a rule, the higher the ratings of the game's critics and users, the higher its sales (not without emissions, of course)

The weakest dependence of sales on user ratings - less than 9%
The dependence of sales on the ratings of critics is higher - about 25%, but it is still considered weak There is a dependence of the ratings of critics and the ratings of users - almost 59%

In [None]:
#Look at the general distribution of games by genre. What about the most profitable genres?
games_data.pivot_table(index='genre', values='name', aggfunc = 'count').sort_values(by='name', ascending=False)
#Most action, misc and sports games
games_data.pivot_table(index='genre', values='total_sales', aggfunc = 'sum').sort_values(by='total_sales', ascending=False)

It is quite expected that sales for these genres will be high.
However, the shooter genre still has high sales, despite the fact that the number of games in this genre is not so high. This indicates the popularity of the shooter genre, more copies are sold
Are high and low selling genres stand out?
Outsiders in sales - puzzle and strategy games

In [None]:
#Step 5. Make a portrait of the user of each region
#Define for each user region (NA, EU, JP):
#Most popular platforms (top 5). Describe the differences in sales shares
games_data.groupby(by='platform').agg({'na_sales':'sum'}).sort_values(by='na_s ales', ascending=False).head(5).plot(kind='bar', color='blue', legend=True)
plt.title('Top 5 Popular Gaming Platforms in North America')
plt.show()
games_data.groupby(by='platform').agg({'eu_sales':'sum'}).sort_values(by='eu_s ales', ascending=False).head(5).plot(kind='bar', color='green', legend=True)
plt.title('Top 5 Popular Gaming Platforms in Europe')
plt.show()
games_data.groupby(by='platform').agg({'jp_sales':'sum'}).sort_values(by='jp_s ales', ascending=False).head(5).plot(kind='bar', color='maroon', legend=True)
plt.title('Top 5 Popular Gaming Platforms in Japan')
plt.show()

North American share of sales is the largest, in the observed period players preferred the Xbox 360 and PS3 The share of sales in the European market is almost 2 times less than the North American one, the top in popularity is the same platforms, only in reverse order (PS3 and Xbox 360)
In Japan, the market is even smaller, and the most popular console is Nintendo 3DS, released in 2011 Most likely because the Japanese prefer portable consoles to play on the go or on the go

In [None]:
#Most popular genres (top 5). Explain the difference
games_data.groupby(by='genre').agg({'na_sales':'sum'}).sort_values(by='na_sale s', ascending=False).head(5).plot(kind='bar', color='blue', legend=True)
plt.title('Top 5 Popular Game Genres in North America')
plt.show()
games_data.groupby(by='genre').agg({'eu_sales':'sum'}).sort_values(by='eu_sale s', ascending=False).head(5).plot(kind='bar', color='green', legend=True)
plt.title('Top 5 Popular Game Genres in Europe')
plt.show()
games_data.groupby(by='genre').agg({'jp_sales':'sum'}).sort_values(by='jp_sale s', ascending=False).head(5).plot(kind='bar', color='maroon', legend=True)
plt.title('Top 5 Popular Game Genres in Japan')
plt.show()

Action and shooter genres are preferred in North America and Europe
Whereas in Japan, RPGs are the most popular (well, in other words, they are not there like in the rest of the world)
Firstly, the Japanese gaming market is one of the most ancient, so the average age of players can often reach 40-50 years.
Secondly, as can be seen from the conclusions above, these same users have been playing on Nintendo since the late 80s and continue to play on Nintendo of the new generation.
Third, the mentality of the Japanese is very different from that of an American or a European.
They have their own completely unique culture, with their own traditions and characteristics. Take at least the same cult of anime, manga and comics, slot machines and other Japanese "jokes".
The Japanese are more peaceful and prefer harmony in everything. Therefore, it is noticeable that shooters are not at all interesting to them, unlike users of other considered markets.

In [None]:
#Does the ESRB rating affect sales in a particular region?
games_data.groupby(by='rating').agg({'na_sales':'sum'}).sort_values(by='na_sal es', ascending=False).plot(kind='bar', color='blue', legend=True)
plt.title('Total Game Sales by Age Rating in North America')
plt.show()
games_data.groupby(by='rating').agg({'eu_sales':'sum'}).sort_values(by='eu_sal es', ascending=False).plot(kind='bar', color='green', legend=True)
plt.title('Total Game Sales by Age Rating in Europe')
plt.show()
games_data.groupby(by='rating').agg({'jp_sales':'sum'}).sort_values(by='jp_sal es', ascending=False).plot(kind='bar', color='maroon', legend=True)
plt.title('Total Game Sales by Age Rating in Japan')
plt.show()

In America and Europe, the situation is about the same, the most popular games are in category E (for everyone), then - games with an unspecified rating
In Japan, on the contrary, in 1st place - the rating is not specified
It is possible that the most popular games for the Japanese market are produced by Japan itself, and are not particularly popular in the rest of the world, therefore they do not have an international rating.

In [None]:
#Step 6. Conduct a study of statistical indicators
#How do user ratings and critics ratings change across genres?
user_genre_pivot = games_data.pivot_table(index='genre',values='user_score', aggfunc = 'mean')
print('Average user ratings by game genre')
print(user_genre_pivot.sort_values(by='user_score', ascending=False))
#The average user rating for all genres is in the range of 6.2-7.3 (1-10)

In [None]:
critic_genre_pivot = games_data.pivot_table(index='genre',values='critic_scor e', aggfunc = 'mean')
print('Average Critic Ratings by Game Genre')
print(critic_genre_pivot.sort_values(by='critic_score', ascending=False))
#The average critics' rating for all genres is in the range of 65-73 (1-100) What indicates the similarity of the ratings data

In [None]:
games_data.groupby(by='genre').agg({'user_score':'mean'}).sort_values(by='user_score', ascending=False).plot(kind='bar', color='blue', legend=True)
plt.title('User ratings by game genre')
plt.show()

games_data.groupby(by='genre').agg({'critic_score':'mean'}).sort_values(by='critic_score', ascending=False).plot(kind='bar', color='maroon', legend=True)
plt.title('Critics ratings by game genre')
plt.show()
#Users are more willing to give high scores to the RPG, Platform, Puzzle genres
#Critics are more willing to give high scores for almost the same genres - Platform, RPG, Strategy

In [None]:
#Calculate the mean, variance and standard deviation
games_data_genre_score = pd.DataFrame()
games_data_genre_score['genre'] = games_data['genre']
games_data_genre_score['critic_score'] = games_data['critic_score']
games_data_genre_score['user_score'] = games_data['user_score']

genre_list = ['action', 'adventure', 'fighting', 'misc', 'platform', 'puzzle', 'racing', 'role-playing', 'shooter', 'simulation', 'sports', 'strategy']
for genre in genre_list:
    variance_estimate = np.var(games_data_genre_score[games_data_genre_score['genre'] == genre], ddof=1)

print('Dispersions by genre:', genre)
print(variance_estimate)
print('Standard deviation by genre:', genre)
print(np.sqrt(variance_estimate))
print('	')

Overall, Critical Review Standard Deviation across all genres is around 10-15 points Minimum values for RPG and Puzzle genres
Maximum in Adventure and Sports genres

The standard deviation of user reviews across all genres is also about the same - about 1.5 The minimum outstanding value for the RPG genre, minimum distribution spread

In [None]:
#Plot histograms. Describe the distribution
for genre in genre_list:
    print('Density of distribution of user ratings by genre:', genre)
    games_data[games_data['genre'] == genre]['user_score'].hist(bins=10)
    plt.show()
    print('----------------------------------------------------------------')
    print('Density of distribution of critics ratings by genre:', genre)
    games_data[games_data['genre'] == genre]['critic_score'].hist(bins=100, color='maroon')
    plt.show()
    print('----------------------------------------------------------------')

#The histograms by both users and critics across all genres are similar in that they are
# negatively skewed due to the average rating above 5 out of 10.

Step 7. Test hypotheses
The average user ratings for the Xbox One and PC platforms are the same: H0: Average user ratings for Xbox One and PC platforms are the same
H1: Average user ratings for Xbox One and PC platforms vary

Set your own threshold value alpha:
0.05 is the standard value for this kind of research (not too hard)

In [None]:
alpha = 0.05

xbox_one = games_data[(games_data['platform'] == 'xone') & (games_data['user_s core'] > 0)]['user_score']
pc = games_data[(games_data['platform'] == 'pc') & (games_data['user_score'] > 0)]['user_score']

results = st.ttest_ind(xbox_one, pc)
print('p-value:', results.pvalue)
if (results.pvalue < alpha):
    print("Rejecting the null hypothesis")
else:
    print("Failed to reject the null hypothesis")

This means that we could not reject the hypothesis that the average user ratings for the Xbox One and PC platforms are the same.
Yet this does not mean that we have proven that these ratings are the same. It's just that our sample looks like these ratings are similar.
It often happens that on one console the game turns out to be successful without serious bugs, while on another platform the main bugs may not be fixed for years
But, apparently, the quality of games on PC and Xbox One is about the same.

In [None]:
#The average user ratings of the Action and Sports genres are different:
#H0: Average user ratings for Action and Sports genres are the same
#H1: Average user ratings for Action and Sports are different
alpha = 0.05

action = games_data[(games_data['genre'] == 'action') & (games_data['user_scor e'] > 0)]['user_score']
sports = games_data[(games_data['genre'] == 'sports') & (games_data['user_scor e'] > 0)]['user_score']

results = st.ttest_ind(action, sports)
print('p-value:', results.pvalue)
if (results.pvalue < alpha):
    print("Rejecting the null hypothesis")
else:
    print("Failed to reject the null hypothesis")
#We reject the hypothesis that the average user ratings for the Action and
# Sports genres are the same. In our sample, it turns out that users, on average,
# give different ratings to games in the Action and Sports genres.

How did you formulate the null and alternative hypotheses:
H0 - always for equality, or for the absence of any changes H1 - alternative (opposite)

What criterion was used to test hypotheses and why:
Student's t-test, because we are working with a sample, not a general population

Conclusions:
Games began publishing in the early 1980s.
But it took as long as 15-20 years of technology development to start mass production of games. The main peak in the release of games falls on 2008-2010.
This was followed by a decline in the gaming industry associated with the development of smartphones and tablets, which is why many users switched to mobile devices.

On average, the platform relevance cycle is 8-10 years.
Then the relevance of the platform disappears, or the next generation of the platform comes out.
There is one special console that remained relevant for an abnormally long period - for 28 years Nintendo did not let go of the players, mostly Japanese.

To build forecasts for 2017, we took data for the period of the industry downturn in 2009-2015.

It is noticeable that with the evolution of a specific platform, sales by games, as well as the cost of producing games, grow.

Most of the games were released in the Action, Miscellaneous and Sports genres.

In different markets, user preferences regarding gaming platforms and genres may differ. For example, as already mentioned, the Japanese prefer Nintendo and RPGs the most.
In general, the Japanese market is very different from other markets, mainly due to all the early development since the 80s and the completely unique mentality of the Japanese.