# Step 1. Open the data file and study the general information

## Project description
We work for the online store Ice, which sells video games all over the world.
We need to identify patterns that determine whether a game succeeds or not. This will allow us to spot potential big winners and plan advertising campaigns.


## Import

In [1290]:
import pandas as pd
import numpy as np
import chart_studio.plotly as py
import seaborn as sns
import plotly.express as px
from scipy import stats as st
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
import sidetable

%matplotlib inline

## Load data

In [1291]:
try:
    df_games_raw = pd.read_csv('games.csv')
except:
    df_games_raw = pd.read_csv('/datasets/games.csv')

## Explore initial data

In [1292]:
print('General info about the data')
print(df_games_raw.info())
print()

print('Five first rows')
print(df_games_raw.head())
print()

print('Description of the numerical columns')
print(df_games_raw.describe())
print()

print('Description of the textual columns')
print(df_games_raw.describe(include=object))
print()

General info about the data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16713 non-null  object 
 1   Platform         16715 non-null  object 
 2   Year_of_Release  16446 non-null  float64
 3   Genre            16713 non-null  object 
 4   NA_sales         16715 non-null  float64
 5   EU_sales         16715 non-null  float64
 6   JP_sales         16715 non-null  float64
 7   Other_sales      16715 non-null  float64
 8   Critic_Score     8137 non-null   float64
 9   User_Score       10014 non-null  object 
 10  Rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB
None

Five first rows
                       Name Platform  Year_of_Release         Genre  NA_sales  \
0                Wii Sports      Wii           2006.0        Sports     41.36   
1         Super Mario Bros.  

In [1293]:
df_games_raw.stb.missing(style=True)

Unnamed: 0,missing,total,percent
Critic_Score,8578,16715,51.32%
Rating,6766,16715,40.48%
User_Score,6701,16715,40.09%
Year_of_Release,269,16715,1.61%
Name,2,16715,0.01%
Genre,2,16715,0.01%
Platform,0,16715,0.00%
NA_sales,0,16715,0.00%
EU_sales,0,16715,0.00%
JP_sales,0,16715,0.00%


### Notes on explore initial data

The data contains the following columns:
- Name
- Platform
- Year_of_Release
- Genre
- NA_sales (North American sales in USD million)
- EU_sales (sales in Europe in USD million)
- JP_sales (sales in Japan in USD million)
- Other_sales (sales in other countries in USD million)
- Critic_Score (maximum of 100)
- User_Score (maximum of 10)
- Rating (ESRB)

In the data we have 16,715 entries. Only in the Platform column and all 4 sales column (NA, EU, JP and others) all rows with values. 
In  Name and Genre we have 2 missing values.
In Year_of_Release we have 269 missing values.
In Critic_Score we have 8,578 missing values - this is more than 50% !
In Rating we have 6,766 missing values. And in User_Score we have 6,701 missing values.

From the description of the numerical columns we can see that for the sales column we have large amount of 0 values in the columns. We know that because the 1st quartile these column is 0. 

From the description of the textual columns we see that in User_Score column there is a mixture of numerical scores and text values of TBD (to be determined).

The Year of Release type should be int






# Step 2. Prepare the data

## Replace the column names (make them lowercase)

In [1294]:
# rename Pandas columns to lower case 
df_games_raw.columns= df_games_raw.columns.str.lower()
df_games_raw.columns

Index(['name', 'platform', 'year_of_release', 'genre', 'na_sales', 'eu_sales',
       'jp_sales', 'other_sales', 'critic_score', 'user_score', 'rating'],
      dtype='object')

done

## Why are they missing values?

from scrolling over the data we noticed that in lot of places where 1 or sometimes all of these columns where missing it was data from long time ago. In the previous millennia. We will create data for entries before 2000 and check if before that year the rate of missing values was higher.

In [1295]:
df_games_before_2000 =  df_games_raw.query('year_of_release < 2000')
df_games_after_2000 =  df_games_raw.query('year_of_release >= 2000')

In [1296]:
df_games_raw.stb.missing(style=True)

Unnamed: 0,missing,total,percent
critic_score,8578,16715,51.32%
rating,6766,16715,40.48%
user_score,6701,16715,40.09%
year_of_release,269,16715,1.61%
name,2,16715,0.01%
genre,2,16715,0.01%
platform,0,16715,0.00%
na_sales,0,16715,0.00%
eu_sales,0,16715,0.00%
jp_sales,0,16715,0.00%


In [1297]:
df_games_before_2000.stb.missing(style=True)

Unnamed: 0,missing,total,percent
critic_score,1880,1976,95.14%
user_score,1875,1976,94.89%
rating,1871,1976,94.69%
name,2,1976,0.10%
genre,2,1976,0.10%
platform,0,1976,0.00%
year_of_release,0,1976,0.00%
na_sales,0,1976,0.00%
eu_sales,0,1976,0.00%
jp_sales,0,1976,0.00%


In [1298]:
df_games_after_2000.stb.missing(style=True)

Unnamed: 0,missing,total,percent
critic_score,6583,14470,45.49%
rating,4807,14470,33.22%
user_score,4732,14470,32.70%
name,0,14470,0.00%
platform,0,14470,0.00%
year_of_release,0,14470,0.00%
genre,0,14470,0.00%
na_sales,0,14470,0.00%
eu_sales,0,14470,0.00%
jp_sales,0,14470,0.00%


We see that before the year 2000 there was about 90% missing from critic_score, user_score and rating

Let's see in each year how many values we have

In [1299]:
df_games_raw.groupby(
    by='year_of_release'
).count()[['name', 'rating', 'critic_score', 'user_score']]

Unnamed: 0_level_0,name,rating,critic_score,user_score
year_of_release,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980.0,9,0,0,0
1981.0,46,0,0,0
1982.0,36,0,0,0
1983.0,17,0,0,0
1984.0,14,0,0,0
1985.0,14,1,1,1
1986.0,21,0,0,0
1987.0,16,0,0,0
1988.0,15,1,1,1
1989.0,17,0,0,0


We see that for all 3 there is a big drop of entries before 2000 and before 1994 there was almost no entries in these fields. 

A possible reason for that may be lack of documentation from so long time ago. 

## Treating missing values

### name

In [1300]:
df_games_raw[df_games_raw['name'].isnull()]

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
659,,GEN,1993.0,,1.78,0.53,0.0,0.08,,,
14244,,GEN,1993.0,,0.0,0.0,0.03,0.0,,,


Looks like someone forgot to add these names. We will fill them with 'unknown'

In [1301]:
df_games = df_games_raw
df_games['name'] = df_games['name'].fillna('unknown')
df_games[df_games['name'].isnull()]

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating


done

### genere

In [1302]:
df_games[df_games['genre'].isnull()]

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
659,unknown,GEN,1993.0,,1.78,0.53,0.0,0.08,,,
14244,unknown,GEN,1993.0,,0.0,0.0,0.03,0.0,,,


It's the same place where we have missing values in the name. Also many other places here with no values. We will remove these rows since there is nothing to study from them

In [1303]:
df_games.dropna(subset=['genre'] ,inplace=True)
df_games[df_games['genre'].isnull()]

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating


done

### year_of_release

Add new column for duplicate game name

In [1304]:
df_games['duplicate_name'] = df_games.duplicated(
    subset='name', keep=False
)

Check if we can restore rating and year of release by the names for games with duplicates 

In [1305]:
df_games.query('duplicate_name').sort_values(by='name').sort_values(by='name')[1000:1020]

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating,duplicate_name
14970,Captain Morgane and the Golden Turtle,PC,2012.0,Adventure,0.0,0.02,0.0,0.0,54.0,6.2,,True
16359,Captain Morgane and the Golden Turtle,Wii,2012.0,Adventure,0.0,0.01,0.0,0.0,,tbd,,True
15225,Captain Morgane and the Golden Turtle,PS3,2012.0,Adventure,0.0,0.02,0.0,0.0,,tbd,,True
16530,Carmageddon: Max Damage,PS4,2016.0,Action,0.01,0.0,0.0,0.0,51.0,5.5,M,True
15456,Carmageddon: Max Damage,XOne,2016.0,Action,0.01,0.01,0.0,0.0,52.0,7.1,M,True
15050,Carmen Sandiego: The Secret of the Stolen Drums,XB,2004.0,Action,0.02,0.01,0.0,0.0,53.0,tbd,E,True
12851,Carmen Sandiego: The Secret of the Stolen Drums,PS2,2004.0,Action,0.03,0.02,0.0,0.01,53.0,tbd,E,True
15310,Carmen Sandiego: The Secret of the Stolen Drums,GC,2004.0,Action,0.02,0.0,0.0,0.0,57.0,tbd,E,True
840,Carnival Games,DS,2008.0,Misc,1.21,0.63,0.0,0.19,48.0,3.3,E,True
294,Carnival Games,Wii,2007.0,Misc,2.12,1.47,0.05,0.42,56.0,6,E,True


We see that for games that was released in multiple platforms the year_of_release and rating is same for all occurrence. So we can use the cases where there is information to complete the missings.  

In [1306]:
df_games.stb.missing(style=True)

Unnamed: 0,missing,total,percent
critic_score,8576,16713,51.31%
rating,6764,16713,40.47%
user_score,6699,16713,40.08%
year_of_release,269,16713,1.61%
name,0,16713,0.00%
platform,0,16713,0.00%
genre,0,16713,0.00%
na_sales,0,16713,0.00%
eu_sales,0,16713,0.00%
jp_sales,0,16713,0.00%


In [1307]:
dict_of_name_and_year = dict(zip(df_games.dropna().name,df_games.dropna().year_of_release))
df_games['year_of_release'] = df_games['year_of_release'].fillna(df_games['name'].map(dict_of_name_and_year))
df_games.stb.missing(style=True)

Unnamed: 0,missing,total,percent
critic_score,8576,16713,51.31%
rating,6764,16713,40.47%
user_score,6699,16713,40.08%
year_of_release,167,16713,1.00%
name,0,16713,0.00%
platform,0,16713,0.00%
genre,0,16713,0.00%
na_sales,0,16713,0.00%
eu_sales,0,16713,0.00%
jp_sales,0,16713,0.00%


We managed to fill 102 cells.
In all the rest we will put 0 to not interfere with changing type

### rating

In [1308]:
dict_of_name_and_rating = dict(zip(df_games.dropna().name,df_games.dropna().rating))
df_games['rating'] = df_games['rating'].fillna(df_games['name'].map(dict_of_name_and_rating))
df_games.stb.missing(style=True)

Unnamed: 0,missing,total,percent
critic_score,8576,16713,51.31%
user_score,6699,16713,40.08%
rating,6403,16713,38.31%
year_of_release,167,16713,1.00%
name,0,16713,0.00%
platform,0,16713,0.00%
genre,0,16713,0.00%
na_sales,0,16713,0.00%
eu_sales,0,16713,0.00%
jp_sales,0,16713,0.00%


We managed to fill 300 cells. The rest we cannot fill so we will put 'unknown'

In [1309]:
df_games.loc[df_games['rating'].isna(), 'rating'] = 'unknown'

In [1310]:
df_games.stb.missing(style=True)

Unnamed: 0,missing,total,percent
critic_score,8576,16713,51.31%
user_score,6699,16713,40.08%
year_of_release,167,16713,1.00%
name,0,16713,0.00%
platform,0,16713,0.00%
genre,0,16713,0.00%
na_sales,0,16713,0.00%
eu_sales,0,16713,0.00%
jp_sales,0,16713,0.00%
other_sales,0,16713,0.00%


Now we will fill in year_of_release with nan the value 0 to help us in the future analysis. And change type to int

In [1311]:
df_games['year_of_release'] = df_games['year_of_release'].fillna(0).astype(int)

### tbd in user_score

In [1312]:
number = df_games.query('user_score =="tbd"').shape[0]
f'We have {number} rows with tbd.'

'We have 2424 rows with tbd.'

TBD means to be determined. This is equal to missing value. As far as we concearn. So we will replace tbd with 'unknown'

In [1313]:
df_games.loc[(df_games['user_score'] == "tbd") |  (df_games['user_score'].isna()) , 'user_score'] = 'unknown'

In [1314]:
df_games.stb.missing(style=True)

Unnamed: 0,missing,total,percent
critic_score,8576,16713,51.31%
name,0,16713,0.00%
platform,0,16713,0.00%
year_of_release,0,16713,0.00%
genre,0,16713,0.00%
na_sales,0,16713,0.00%
eu_sales,0,16713,0.00%
jp_sales,0,16713,0.00%
other_sales,0,16713,0.00%
user_score,0,16713,0.00%


### critic_score

We can't conclude from scores on other platform the critic score for places where there is no score because the score is affected by the platform. We will replace with unknown

In [1315]:
df_games.loc[df_games['critic_score'].isna(), 'critic_score'] = 'unknown'
df_games.stb.missing(style=True)

Unnamed: 0,missing,total,percent
name,0,16713,0.00%
platform,0,16713,0.00%
year_of_release,0,16713,0.00%
genre,0,16713,0.00%
na_sales,0,16713,0.00%
eu_sales,0,16713,0.00%
jp_sales,0,16713,0.00%
other_sales,0,16713,0.00%
critic_score,0,16713,0.00%
user_score,0,16713,0.00%


No more NaN's

## Calculate the total sales

In [1316]:
df_games['total_sales'] = df_games[['na_sales', 'eu_sales', 'jp_sales', 'other_sales']].sum(axis=1)
df_games.sample(5)

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating,duplicate_name,total_sales
7306,NBA,PSP,2005,Sports,0.2,0.0,0.0,0.02,57.0,6.9,E,False,0.22
13086,FIFA Soccer 2004,GBA,2003,Sports,0.04,0.01,0.0,0.0,82.0,7.9,E,True,0.05
2728,Shark Tale,PS2,2004,Action,0.37,0.29,0.0,0.1,69.0,6.5,E,True,0.76
15056,Atari Flashback Classics: Volume 2,PS4,2016,Misc,0.02,0.0,0.0,0.0,unknown,unknown,E,True,0.02
5529,Samurai Warriors 2: Xtreme Legends (JP sales),PS2,2007,Action,0.0,0.0,0.33,0.0,unknown,unknown,unknown,False,0.33


done

# Step 3. Analyze the data

## How many games were released in different years? Is the data for every period significant?

remove for this display the year_of_release 0

In [1317]:
df_games_no_0 = df_games.query('year_of_release != 0')

In [1318]:

year_game = df_games_no_0.groupby(
    by='year_of_release'
).count()['name'].to_frame().reset_index()

print(year_game)
fig = px.bar(year_game, x='year_of_release', y='name')
fig.update_layout(
    title="Number Of Games Per Year",
    yaxis_title="Number Of Games", 
)
fig.show()

    year_of_release  name
0              1980     9
1              1981    46
2              1982    36
3              1983    17
4              1984    14
5              1985    14
6              1986    21
7              1987    16
8              1988    15
9              1989    17
10             1990    16
11             1991    41
12             1992    43
13             1993    60
14             1994   121
15             1995   219
16             1996   263
17             1997   289
18             1998   379
19             1999   338
20             2000   350
21             2001   484
22             2002   845
23             2003   784
24             2004   764
25             2005   948
26             2006  1019
27             2007  1201
28             2008  1440
29             2009  1430
30             2010  1266
31             2011  1144
32             2012   661
33             2013   547
34             2014   581
35             2015   606
36             2016   502


We can see that there is big drop of release before 2000. Moreover before 1990 there was almost no release

## Variation of sales foe platform type

### How sales varied from platform to platform?

In [1319]:
platform_sales = df_games.pivot_table(
    values='total_sales',
    index='platform',
    aggfunc='sum'
).sort_values(ascending=False, by='total_sales').reset_index()
print(platform_sales)

fig = px.bar(platform_sales, x='platform', y='total_sales')
fig.update_layout(
    title="Number Of Sales Per Console",
    yaxis_title="Number Of Sales", 
)
fig.show()

   platform  total_sales
0       PS2      1255.77
1      X360       971.42
2       PS3       939.65
3       Wii       907.51
4        DS       806.12
5        PS       730.86
6       GBA       317.85
7       PS4       314.14
8       PSP       294.05
9        PC       259.52
10      3DS       259.00
11       XB       257.74
12       GB       255.46
13      NES       251.05
14      N64       218.68
15     SNES       200.04
16       GC       198.93
17     XOne       159.32
18     2600        96.98
19     WiiU        82.19
20      PSV        54.07
21      SAT        33.59
22      GEN        28.35
23       DC        15.95
24      SCD         1.86
25       NG         1.44
26       WS         1.42
27     TG16         0.16
28      3DO         0.10
29       GG         0.04
30     PCFX         0.03


The platform with the largest total sales is PS2 - Play Station 2 by Sony

### distribution based on data for each year for PS2

In [1320]:
df_games_PS2 = df_games_no_0.query('platform == "PS2"')

year_PS2 = df_games_PS2.groupby(
    by='year_of_release'
).count()['name'].to_frame().reset_index()

print(year_PS2)
fig = px.bar(year_PS2, x='year_of_release', y='name')
fig.update_layout(
    title="Number Of PS2 Games Per Year",
    yaxis_title="Number Of Games", 
)
fig.show()

    year_of_release  name
0              2000    82
1              2001   185
2              2002   285
3              2003   258
4              2004   259
5              2005   261
6              2006   262
7              2007   215
8              2008   191
9              2009    96
10             2010    38
11             2011     7


### Find platforms that used to be popular but now have zero sales. How long does it generally take for new platforms to appear and old ones to fade?

In [1321]:
df_games_console_pivot = df_games_no_0.pivot_table(
    index=['platform', 'year_of_release'],
    values='total_sales',
    aggfunc='sum'
).reset_index()
df_games_console_pivot

Unnamed: 0,platform,year_of_release,total_sales
0,2600,1980,11.38
1,2600,1981,35.68
2,2600,1982,28.88
3,2600,1983,5.84
4,2600,1984,0.27
...,...,...,...
236,XB,2008,0.18
237,XOne,2013,18.96
238,XOne,2014,54.07
239,XOne,2015,60.14


In [1322]:
df_games_console_pivot['year_of_release'] = df_games_console_pivot['year_of_release'].astype(str)
fig = px.bar(df_games_console_pivot, x='platform', y='total_sales', barmode='group', color='year_of_release')
fig.update_traces(showlegend=False)
fig.show()

sample of console that use to be popular but now have zero sales

In [1323]:
zero_sales_now = ['PSP', 'PS', 'DS', 'N64']

In [1324]:
df_games_zero_sales_now = df_games_no_0.query('platform in @zero_sales_now')
df_games_zero_sales_now_pivot = df_games_zero_sales_now.pivot_table(
    index=['platform', 'year_of_release'],
    values='total_sales',
    aggfunc='sum'
).reset_index()
print(df_games_zero_sales_now_pivot)

   platform  year_of_release  total_sales
0        DS             1985         0.02
1        DS             2004        17.27
2        DS             2005       130.14
3        DS             2006       119.81
4        DS             2007       147.23
5        DS             2008       145.36
6        DS             2009       119.56
7        DS             2010        85.35
8        DS             2011        26.23
9        DS             2012        11.67
10       DS             2013         1.54
11      N64             1996        34.10
12      N64             1997        39.50
13      N64             1998        49.24
14      N64             1999        57.87
15      N64             2000        33.97
16      N64             2001         3.25
17      N64             2002         0.08
18      N64             2004         0.33
19       PS             1994         6.03
20       PS             1995        35.96
21       PS             1996        94.70
22       PS             1997      

In [1325]:
df_games_zero_sales_now_pivot['year_of_release'] = df_games_zero_sales_now_pivot['year_of_release'].astype(str)
fig = px.bar(df_games_zero_sales_now_pivot, x='platform', y='total_sales', barmode='group', color='year_of_release')
fig.update_traces(showlegend=False)
fig.show()

We can see in the figure that the consoles have popularity pick and the they fade. It takes little less than a decade for the glory of a console to fade.

We will also take the 4 most popular console for this task. Here we will answer which platforms are leading in sales? Which ones are growing or shrinking? 

In [1326]:
most_sale_console = platform_sales.nlargest(4, 'total_sales')['platform'].tolist()
most_sale_console

['PS2', 'X360', 'PS3', 'Wii']

In [None]:
df_games_most_sale_console = df_games_no_0.query('platform in @most_sale_console')
df_games_most_sale_console_pivot = df_games_most_sale_console.pivot_table(
    index=['platform', 'year_of_release'],
    values='total_sales',
    aggfunc='sum'
).reset_index()
df_games_most_sale_console_pivot


In [None]:
df_games_most_sale_console_pivot['year_of_release'] = df_games_most_sale_console_pivot['year_of_release'].astype(str)
fig = px.bar(df_games_most_sale_console_pivot, x='platform', y='total_sales', barmode='group', color='year_of_release')
fig.update_traces(showlegend=False)
fig.show()

We can see in the figure that the consoles have popularity pick and the they fade. It takes approximately a decade for the glory of a console to fade.

## Determine what period we should take data for

From the answer in previous question when we examine the 4 most sell platform we see there are no sell before 2000. We also saw in previous chart that before 2000 there was much less games released.

## Working data

From now we will work only from data after 2000

In [None]:
df_games_after_2000 =  df_games.query('year_of_release >= 2000')

## Box plot for sales  broken down by platform

In [None]:

fig = px.box(df_games_after_2000, x="platform", y="total_sales")
fig.show()

Very difficult to read. We will remove results of total sales larger than 10

In [None]:
fig = px.box(df_games_after_2000.query('total_sales < 10'), x="platform", y="total_sales")
fig.show()

Very difficult to read. We will remove results of total sales larger than 2

In [None]:
fig = px.box(df_games_after_2000.query('total_sales < 2'), x="platform", y="total_sales")
fig.show()

We can see that the difference in sales are significant in some cases. Take for example the X360 compare to PSV. The nedian in X360 is higher from the q3 of PSV

In [None]:
df_games_after_2000_pivot = df_games_after_2000.pivot_table(
    values='total_sales',
    index='platform',
    aggfunc='mean'
).sort_values(by='total_sales', ascending=True).reset_index()

fig = px.bar(df_games_after_2000_pivot, x='platform', y='total_sales')
fig.update_layout(
    title='Average Total Sales Per Platform',
    yaxis_title="Average Total Sales", 
)
fig.show()

In [None]:
# we will use this later
most_sale_console_average = df_games_after_2000_pivot.tail(5)['platform'].tolist()

most_sale_console_average

There is big difference in the Average Total Sales Per Platform. 

## user and professional reviews affect sales for one popular platform

We will choose PS4. The platform with the highest average sales for this task. 

We will build a scatter plot and calculate the correlation between reviews and sales.

In [None]:
df_games_after_2000_PS4 = df_games_after_2000.query('platform == "PS4"')
df_games_after_2000_PS4


In [None]:
df_games_after_2000_PS4_no_unknown = df_games_after_2000_PS4[
    df_games_after_2000_PS4['critic_score'] != 'unknown' 
]
df_games_after_2000_PS4_no_unknown = df_games_after_2000_PS4_no_unknown[
    df_games_after_2000_PS4_no_unknown['user_score'] != 'unknown'
]

# df_games_after_2000_PS4_no_unknown = df_games_after_2000_PS4_no_unknown
df_games_after_2000_PS4_no_unknown['critic_score'] = df_games_after_2000_PS4_no_unknown[
    'critic_score'].astype(int)
df_games_after_2000_PS4_no_unknown['user_score'] = df_games_after_2000_PS4_no_unknown[
    'user_score'].astype('float')

df_games_after_2000_PS4_no_unknown[['critic_score', 'user_score', 'total_sales']].corr()


In [None]:

fig = px.scatter(df_games_after_2000_PS4_no_unknown, x="critic_score", y="total_sales")
fig.show()

In [None]:
fig = px.scatter(df_games_after_2000_PS4_no_unknown, x="user_score", y="total_sales")
fig.show()

In user score we don't see correlation but in critic score we can spot a trend in the scatter plot and the correlation is 0.4 - medium correlation.

## compare the sales of the same games on other platforms.

We will create function to automate this comparison. For the compariso we will take the 5 platforms with highest sales: most_sale_console_average

In [None]:
most_sale_console_average

In [None]:
# remove PS4 because we saw it and GB because it is fill with nan in the scores
most_sale_console_average.remove('PS4')
most_sale_console_average.remove('GB')
most_sale_console_average

In [None]:
def score_effect(platform:str, score_type:str, df:pd.DataFrame):
    df = df.query('platform == @platform')
    
    df = df[
    df[score_type] != 'unknown' 
    ]

    df[score_type] = df[score_type].astype(float)
    correlation = round(df[score_type].corr(df['total_sales']), 2)
#     print(f'Info for {platform}')
    print(f'The correlation between {score_type} and total_sales is {correlation}')
    
    fig = px.scatter(df, x=score_type, y="total_sales", title=platform)
    fig.show()
    
# score_effect('PS4', 'critic_score', df_games)
    

In [None]:
for platform in most_sale_console_average:
    print(f'Info for {platform}')
    score_effect(platform=platform, score_type='critic_score', df=df_games_after_2000)
    score_effect(platform=platform, score_type='user_score', df=df_games_after_2000)
    print()

In all 3 compared platforms 'Wii', 'PS3', 'X360' the correlation between the critic score and the total sales was higher then the correlation between the user score and total sales

## general distribution of games by genre

In [None]:
df_games_after_2000.pivot_table(
    values='total_sales',
    index='genre',
    aggfunc='mean'
).sort_values(by='total_sales', ascending=False)


The most profitable are games with more action like shooter, sports and racing while the less profitable are more calm games genre like adventure, strategy and puzzle. 

# Step 4. Create a user profile for each region

## The top five platforms

In [None]:
def most_sales_per_platform_for_region(df:pd.DataFrame, region:str):
    df = df.pivot_table(
    values=region,
    index='platform',
    aggfunc='mean'
    ).nlargest(5, region).sort_values(by=region, ascending=False)
    print(df)

In [None]:
regions = ['na_sales', 'eu_sales', 'jp_sales']

In [None]:
for region in regions:
    print(region)
    most_sales_per_platform_for_region(df=df_games_after_2000, region=region)
    print()

We can see that in the North America (NA) we have the highest sells amount. The Japanese market is the smallest and dominated by 3 platforms we don't see evidence for in the top five in the North America and Europe region.

## The top five genres

In [None]:
def most_sales_per_genre_for_region(df:pd.DataFrame, region:str):
    df = df.pivot_table(
    values=region,
    index='genre',
    aggfunc='mean'
    ).nlargest(5, region).sort_values(by=region, ascending=False)
    print(df)

In [None]:
for region in regions:
    print(region)
    most_sales_per_genre_for_region(df=df_games_after_2000, region=region)
    print()

Here we see results similar to before with the platform. The North America market is the largest. Here we see great similarity with the genre between the North America and Europe. In the top 5 we see in both 'Shooter', 'Platform', 'Sports' and 'Racing' in almost the same order. In the Japanese market beside that it is much smaller the top genre are mostly different.  

## Do ESRB ratings affect sales in individual regions?

In [None]:
def most_sales_per_rating_for_region(df:pd.DataFrame, region:str):
    df = df.pivot_table(
    values=region,
    index='rating',
    aggfunc='mean'
    ).nlargest(5, region).sort_values(by=region, ascending=False)
    print(df)

In [None]:
for region in regions:
    print(region)
    most_sales_per_rating_for_region(df=df_games_after_2000, region=region)
    print()

- Early Childhood (EC) is the lowest rating. 
- Everyone (E) is the base rating. 
- Everyone 10+ (E10+) signifies games appropriate for kids 10 years and older. 
- Teen (T) is the next level up. 
- Mature (M) is the highest normal rating. 
- Adults Only (AO) is the ESRB's 18+ rating

In the North America and Europe the Adults Only (AO) is the most dominate rating following by Mature (M). The Japanese market is more conservative and rating for adults as M and AO that represent games with sex and violence almost not exist.

# Step 5. Test the following hypotheses:

In [None]:
df_games_after_2000_no_unknown_user_score = df_games_after_2000
df_games_after_2000_no_unknown_user_score = df_games_after_2000_no_unknown_user_score[
    df_games_after_2000_no_unknown_user_score['user_score'] != 'unknown']

In [None]:
# getting a list of user rating
def get_list_of_user_score(df:pd.DataFrame, column:str, name:str):
    df = df.query(f'{column}  == "{name}"')
    user_score_list = df['user_score'].tolist()
    user_score_list = [float(i) for i in user_score_list]
    user_score_df = pd.DataFrame(user_score_list)
    return user_score_df


## Hypothesis: Average user ratings of the Xbox One and PC platforms are the same

In [None]:
# get_list_of_user_score(
#         df=df_games_after_2000_no_unknown_user_score,
#         column='platform',
#         name='XOne' 
#     )

In [None]:
alpha = 0.05  # critical statistical significance level
# if the p-value is less than alpha, we reject the hypothesis

results = st.ttest_ind(
    get_list_of_user_score(
        df=df_games_after_2000_no_unknown_user_score,
        column='platform',
        name='XOne' 
    ), 
    get_list_of_user_score(
        df=df_games_after_2000_no_unknown_user_score,
        column='platform',
        name='PC' 
    )
)

print('p-value: ', results.pvalue)

if results.pvalue < alpha:
    print("We reject the null hypothesis")
else:
    print("We can't reject the null hypothesis") 

Testing hypothesis on the equality of two population means yields that the two groups are not same with statistical significance of less than 5%.

Let's look on the histograms of the user_score for the two platforms

In [None]:
fig = px.histogram(get_list_of_user_score(
        df=df_games_after_2000_no_unknown_user_score,
        column='platform',
        name='XOne' 
    ), 
    title='Xbox One User Score')
fig.show()

In [None]:
fig = px.histogram(get_list_of_user_score(
        df=df_games_after_2000_no_unknown_user_score,
        column='platform',
        name='PC' 
    ), 
    title='PC User Score')
fig.show()

Both hystograms are skew to the left but differs by the peak area. In PC the score peak is higher and sharper while in the PC it is wider.

In [None]:
print('Description of the Xbox One User Score')
print(get_list_of_user_score(
        df=df_games_after_2000_no_unknown_user_score,
        column='platform',
        name='XOne' 
    ).describe())

print()

print('Description of the PC User Score')
print(get_list_of_user_score(
        df=df_games_after_2000_no_unknown_user_score,
        column='platform',
        name='PC' 
    ).describe())

The mean of PC user score is higher while the median of the Xbox One is higher with lower std

## Average user ratings for the Action and Sports genres are different.

In [None]:
alpha = 0.05  # critical statistical significance level
# if the p-value is less than alpha, we reject the hypothesis

results = st.ttest_ind(
    get_list_of_user_score(
        df=df_games_after_2000_no_unknown_user_score,
        column='genre',
        name='Action' 
    ), 
    get_list_of_user_score(
        df=df_games_after_2000_no_unknown_user_score,
        column='genre',
        name='Sports' 
    )
)

print('p-value: ', results.pvalue)

if results.pvalue < alpha:
    print("We can't reject the null hypothesis") 
else:
    print("We reject the null hypothesis") 

Here we wanted to know if the compared list are different. Since the p-value was larger then the alpha we set (it's higher then 10%) then we reject the null hypothesis that they different.

Let's look on the histograms of the user_score for the two genres

In [None]:
fig = px.histogram(get_list_of_user_score(
        df=df_games_after_2000_no_unknown_user_score,
        column='genre',
        name='Action' 
    ), 
    title='Action User Score')
fig.show()

In [None]:
fig = px.histogram(get_list_of_user_score(
        df=df_games_after_2000_no_unknown_user_score,
        column='genre',
        name='Sports' 
    ), 
    title='Sports User Score')
fig.show()

Both histograms looks almost identical

In [None]:
print('Description of the Action User Score')
print(get_list_of_user_score(
        df=df_games_after_2000_no_unknown_user_score,
        column='genre',
        name='Action' 
    ).describe())

print()

print('Description of the Sports User Score')
print(get_list_of_user_score(
        df=df_games_after_2000_no_unknown_user_score,
        column='genre',
        name='Sports' 
    ).describe())

Also here almost identical. No significant difference between user score for action and sports genre.

# Step 6. Write a general conclusion

## Intro
In this project we tried to answer questions that can help to plan the most efficient advertising campaigns. 
We received historical data on video games and we used it to plan the ad campaign of 2017.

## Preprocessing and data preparation
The data contained many missing values. Some we managed to restore by logic. For example: The rating of game and year of release should stay the same regardless of the platform it is released in. So in cases where game was released in multiple platforms and year of release and / or rating was missing we could fill the missing data. There was lots of TBD values in the user score but those where treated as missing values because there can't be any clue from the data on what score to fill instead.

## Analyze the data
We answered some interesting questions about the data that helped us get more familiar with it.
We learned that after the year 2000 there was a great growth in games released per year and before the year 1990 there was almost none that where released. Also the documentation of user and critic score and also the games rating become more common after the year 2000.
We noticed that platforms have a prosperous period of supporting games after which they move into the abyss of oblivion.
Some platform has much more sell of games then the others. The platforms withe the highest sells are: 'Wii', 'PS3', 'X360', 'PS4' and 'GB'.
Games from specific genre are way popular then the others. For example games of shooting are the most popular in general sells.

## User profile
There is great similarity between the North America (NA) market which is the largest to the European (EU) market in the preferred genre and a slight similarity in platforms preferation. 
The Japanese market is much smaller and the customers seemed to be more conservative choosing gmes with less violence and sex and use other platforms that not common in the other regions.

## User score hypothesis
The two hypotheses we tested was:
1. Average user ratings of the Xbox One and PC platforms are the same.
2. Average user ratings for the Action and Sports genres are different.

We rejected the 1st hypothesis. We can't say the user score is the same while the user score for action and sports genre games are almost identical causing us to reject the 2nd hypothesis.