- Recently I got a lot of feedback from my dear friends who just change or about the change their career towards to Data Analysis/ Data Science and Machine Learning areas about the lack of material between beginning the analysis journey and the advanced techniques.

- They are looking for detailed but at the same time beginner friendly, not so much complicated (with different regression, normalization techniques, etc.) explained Explanatory Data Analysis examples, which show them how to start and most importantly how to read the descriptive statistics and graphs.

- After getting these feedbacks, I have decided to make some kind of series of EDA’s from different datasets, without making so complicated for the people at their first steps of DS/ML journey.

### This notebook is part of the 9 Beginner Friendly EDAs. If these EDAs would be helpful to anyone, I would be more than happy.




#### **INTRO**



- In this study, we are going to make Exploratory Data Analysis (EDA) with the Top Games on Google Playstore dataset. 
- Study aims to be beginner friendly and give as much as possible explanation for each step on the way.
- Study's dataset has top 100 games of each category of games on Google Play Store along with their ratings and other data like price and number of installs.

- First, let's import the required libraries.
- We will use Plotly's interactive environment for visualization.

In [1]:
import pandas as pd
import numpy as np


import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

### Overview Stage

- Read the csv
- Look for basic information about the dataset

In [2]:
df= pd.read_csv('../input/top-play-store-games/android-games.csv')
df.head()

Unnamed: 0,rank,title,total ratings,installs,average rating,growth (30 days),growth (60 days),price,category,5 star ratings,4 star ratings,3 star ratings,2 star ratings,1 star ratings,paid
0,1,Garena Free Fire- World Series,86273129,500.0 M,4,2.1,6.9,0.0,GAME ACTION,63546766,4949507,3158756,2122183,12495915,False
1,2,PUBG MOBILE - Traverse,37276732,500.0 M,4,1.8,3.6,0.0,GAME ACTION,28339753,2164478,1253185,809821,4709492,False
2,3,Mobile Legends: Bang Bang,26663595,100.0 M,4,1.5,3.2,0.0,GAME ACTION,18777988,1812094,1050600,713912,4308998,False
3,4,Brawl Stars,17971552,100.0 M,4,1.4,4.4,0.0,GAME ACTION,13018610,1552950,774012,406184,2219794,False
4,5,Sniper 3D: Fun Free Online FPS Shooting Game,14464235,500.0 M,4,0.8,1.5,0.0,GAME ACTION,9827328,2124154,1047741,380670,1084340,False


In [3]:
df.shape

(1730, 15)

- We have 1730 games and 15 different variables to work on.

In [4]:
df.isnull().sum()

rank                0
title               0
total ratings       0
installs            0
average rating      0
growth (30 days)    0
growth (60 days)    0
price               0
category            0
5 star ratings      0
4 star ratings      0
3 star ratings      0
2 star ratings      0
1 star ratings      0
paid                0
dtype: int64

- We have a very clean dataset, which is very rare in the real world. 
- Dataset, without missing values, like a having unicorn in your backyard. Or Ronaldo is playing in your favorite local team :)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1730 entries, 0 to 1729
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rank              1730 non-null   int64  
 1   title             1730 non-null   object 
 2   total ratings     1730 non-null   int64  
 3   installs          1730 non-null   object 
 4   average rating    1730 non-null   int64  
 5   growth (30 days)  1730 non-null   float64
 6   growth (60 days)  1730 non-null   float64
 7   price             1730 non-null   float64
 8   category          1730 non-null   object 
 9   5 star ratings    1730 non-null   int64  
 10  4 star ratings    1730 non-null   int64  
 11  3 star ratings    1730 non-null   int64  
 12  2 star ratings    1730 non-null   int64  
 13  1 star ratings    1730 non-null   int64  
 14  paid              1730 non-null   bool   
dtypes: bool(1), float64(3), int64(8), object(3)
memory usage: 191.0+ KB


- Installs includes number,and should be integer or float data type. But as data type, it is an object data type. It is good to take of note of it 


In [6]:
df.describe()

Unnamed: 0,rank,total ratings,average rating,growth (30 days),growth (60 days),price,5 star ratings,4 star ratings,3 star ratings,2 star ratings,1 star ratings
count,1730.0,1730.0,1730.0,1730.0,1730.0,1730.0,1730.0,1730.0,1730.0,1730.0,1730.0
mean,50.386705,1064332.0,3.908092,321.735896,122.554971,0.010942,762231.5,116436.6,57063.07,27103.36,101495.0
std,28.936742,3429250.0,0.290973,6018.914507,2253.891703,0.214987,2538658.0,302163.1,149531.4,81545.42,408374.5
min,1.0,32993.0,2.0,0.0,0.0,0.0,13975.0,2451.0,718.0,266.0,545.0
25%,25.0,175999.2,4.0,0.1,0.2,0.0,127730.0,20643.0,9652.5,4262.25,12812.0
50%,50.0,428606.5,4.0,0.5,1.0,0.0,296434.0,50980.5,25078.0,10675.5,33686.0
75%,75.0,883797.0,4.0,1.7,3.3,0.0,619835.8,101814.0,52295.0,23228.75,80157.25
max,100.0,86273130.0,4.0,227105.7,69441.4,7.49,63546770.0,5404966.0,3158756.0,2122183.0,12495920.0


Before going further, let's summarize what we have got from the dataset.

- Our dataset has games from different categories, different ratings and different number of installs.  
- 'installs' variable has a good numerical info to use. It would be a good idea to make adjustments on it to use it as a numerical variable
- There is no missing value, which is very good during the data preparation stage.
- 'Category' column is categorical variable, it would be good to see whether any significant differences among the categories of the games.
-  Numerical variables deserves special attention for further analysis.
- 'Paid' and 'Price' seems to have a lot on common. Needs to look in detail and if necessary drop one of them for simplicity.

- Let's make the necessary adjustments before moving to the analysis part.

In [7]:
df['installs'].value_counts()

10.0 M      805
50.0 M      252
5.0 M       245
100.0 M     204
1.0 M       192
500.0 k      15
500.0 M      12
100.0 k       3
1000.0 M      2
Name: installs, dtype: int64

- Let's make 'installs' a numerical variable by doing a small adjustment.

In [8]:
def in_thousand (inst):
    if inst == '500.0 k':
        return '0.5 M' 
    elif inst == '100.0 k':
        return '0.1 M'
    else:
        return inst
df['installs']= df['installs'].apply(in_thousand)

df['installs']= df['installs'].str.replace( 'M', '').str.strip().astype('float')

df= df.rename(columns={'installs': 'installs_in_million'})
df['installs_in_million'].value_counts()

10.0      805
50.0      252
5.0       245
100.0     204
1.0       192
0.5        15
500.0      12
0.1         3
1000.0      2
Name: installs_in_million, dtype: int64

- Let's see price and paid columns and decide whether necessary to continue with both of them or drop one of them.

In [9]:
df['price'].value_counts()

0.00    1723
1.99       3
1.49       1
0.99       1
2.99       1
7.49       1
Name: price, dtype: int64

In [10]:
df['paid'].value_counts()

False    1723
True        7
Name: paid, dtype: int64

- OK, almost 99% of the games are free, and not much sample size to compare betwen the different price range.
- Sample size less than 30, most of the time, not fulfill minimum requirements for the sample - population representativeness.
- For this dataset, 'price' column does not have much to offer for further analysis.
- So let's drop the 'price' column. 
- Dropping column, deleting rows are decisions to be taken very cautiously and should based on analysis and domain knowledge.

In [11]:
df.drop('price', axis=1, inplace=True)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1730 entries, 0 to 1729
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   rank                 1730 non-null   int64  
 1   title                1730 non-null   object 
 2   total ratings        1730 non-null   int64  
 3   installs_in_million  1730 non-null   float64
 4   average rating       1730 non-null   int64  
 5   growth (30 days)     1730 non-null   float64
 6   growth (60 days)     1730 non-null   float64
 7   category             1730 non-null   object 
 8   5 star ratings       1730 non-null   int64  
 9   4 star ratings       1730 non-null   int64  
 10  3 star ratings       1730 non-null   int64  
 11  2 star ratings       1730 non-null   int64  
 12  1 star ratings       1730 non-null   int64  
 13  paid                 1730 non-null   bool   
dtypes: bool(1), float64(3), int64(8), object(2)
memory usage: 177.5+ KB


- Seems OK.  Let's move on to the next step: **analysis part**.

### Analysis Part

- Let's first look at the categories

### Game Categories

In [13]:
df['category'].value_counts(normalize=True)

GAME CARD            0.072832
GAME WORD            0.060116
GAME TRIVIA          0.057803
GAME ADVENTURE       0.057803
GAME SPORTS          0.057803
GAME EDUCATIONAL     0.057803
GAME RACING          0.057803
GAME ROLE PLAYING    0.057803
GAME CASUAL          0.057803
GAME BOARD           0.057803
GAME PUZZLE          0.057803
GAME CASINO          0.057803
GAME STRATEGY        0.057803
GAME MUSIC           0.057803
GAME ARCADE          0.057803
GAME SIMULATION      0.057803
GAME ACTION          0.057803
Name: category, dtype: float64

- We have almost same size categories.

In [14]:
fig = px.histogram(df, x="category", title='Game Categories')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

### Total Ratings

In [15]:
df['total ratings'].describe()

count    1.730000e+03
mean     1.064332e+06
std      3.429250e+06
min      3.299300e+04
25%      1.759992e+05
50%      4.286065e+05
75%      8.837970e+05
max      8.627313e+07
Name: total ratings, dtype: float64

In [16]:

fig = px.histogram(df, x= 'total ratings', title='Total Ratings of the Games')


fig.show()

In [17]:
fig = px.box(df, x= 'total ratings', hover_data = df[['title','category']])
fig.update_traces(quartilemethod="inclusive")
fig.show()

- As we have seen in the histogram, quite a lot of the ratings are in the 0 - 500.000 ratings range.
- On the other hand ve have quite a number of outliers, which increases mean and put it further away from the median.
- We have highly skewed distribution, more specifially right skewed distribution with the possible outliers on the maximum side of the distribution. So for further analysis it would be good to remember that.
- In these kinds of situations, it would be a good idea to look for the median based approach.
- Median value, instead of mean value, should be used for to get some insights from the distributions.

### Number of Game Install

In [18]:
df['installs_in_million'].describe()

count    1730.000000
mean       29.176185
std        60.287333
min         0.100000
25%         5.000000
50%        10.000000
75%        50.000000
max      1000.000000
Name: installs_in_million, dtype: float64

In [19]:
fig = px.histogram(df, x= 'installs_in_million', title='Number of Game Install in Millions')

fig.show()

In [20]:
fig = px.box(df, x= 'installs_in_million', hover_data = df[['title','category']])
fig.update_traces(quartilemethod="inclusive")
fig.show()

- We have rightly skewed distribution with possible outliers.
- Candy Crush Saga with  1 Billion install and Clash of Clans with 500 Million installs shown in the box plot.
- it would be good idea to always check with dataset, in the dataset we have 2 count of 1 Billion install and 12 count of 500 Million installs. And boxplot shows us one example from this number of installs.
- Size of the outliers definitely affect  mean value and distributions.
- Difference between mean value and median value is really  huge (mean = 29.1M,median= 10M)
- As mentioned above, it would be a good idea to use median based approach.

### Paid-Free Games

In [21]:
df['paid'].value_counts(normalize=True)

False    0.995954
True     0.004046
Name: paid, dtype: float64

In [22]:
paid_free= df['paid'].value_counts()
label =['Free','Paid']
fig = px.pie(paid_free, values=df['paid'].value_counts().values, names=label,
             title='Paid & Free Games')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

- Almost all of the games (except 7 out of 1730) in this dataset are free games

- OK after this point we can look deeper into the dataset.

### Total Ratings by Category

In [23]:
total_ratings_by_category = df.groupby('category')['total ratings'].mean()
total_ratings_by_category

category
GAME ACTION          4.011344e+06
GAME ADVENTURE       8.935617e+05
GAME ARCADE          1.793780e+06
GAME BOARD           4.457431e+05
GAME CARD            3.326041e+05
GAME CASINO          3.619031e+05
GAME CASUAL          2.470866e+06
GAME EDUCATIONAL     1.529804e+05
GAME MUSIC           2.163020e+05
GAME PUZZLE          9.466929e+05
GAME RACING          1.139027e+06
GAME ROLE PLAYING    7.087648e+05
GAME SIMULATION      9.341417e+05
GAME SPORTS          1.353829e+06
GAME STRATEGY        1.856570e+06
GAME TRIVIA          2.982217e+05
GAME WORD            3.943603e+05
Name: total ratings, dtype: float64

In [24]:
fig = px.bar(total_ratings_by_category, x= total_ratings_by_category.index, y=total_ratings_by_category.values, labels={'y':'Total Ratings'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Games in the action, casual, strategy,arcade, sports categories are getting considerably more ratings than, games in the educational, music categories.

### Number of Game Installations by Game Category

In [25]:
install_by_category = df.groupby('category')['installs_in_million'].mean()
install_by_category

category
GAME ACTION          74.100000
GAME ADVENTURE       18.030000
GAME ARCADE          71.610000
GAME BOARD           21.230000
GAME CARD            12.484127
GAME CASINO           7.715000
GAME CASUAL          63.970000
GAME EDUCATIONAL     17.895000
GAME MUSIC           12.487000
GAME PUZZLE          36.210000
GAME RACING          46.750000
GAME ROLE PLAYING    14.080000
GAME SIMULATION      27.710000
GAME SPORTS          33.610000
GAME STRATEGY        23.910000
GAME TRIVIA           6.901000
GAME WORD            12.317308
Name: installs_in_million, dtype: float64

In [26]:
fig = px.bar(install_by_category, x= install_by_category.index, y=install_by_category.values, labels={'y':'Install in Millions'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Games in the action, arcade and casual categories are installed significantly more than games in the trivia, casino and word categories.

In [27]:
growth_by_category_30 = df.groupby('category')['growth (30 days)'].mean()
growth_by_category_30

category
GAME ACTION            18.808000
GAME ADVENTURE        259.101000
GAME ARCADE            58.924000
GAME BOARD             34.445000
GAME CARD             746.598413
GAME CASINO          2335.253000
GAME CASUAL            36.020000
GAME EDUCATIONAL      102.455000
GAME MUSIC             24.626000
GAME PUZZLE            44.362000
GAME RACING           207.103000
GAME ROLE PLAYING     209.979000
GAME SIMULATION        13.406000
GAME SPORTS           159.543000
GAME STRATEGY          18.281000
GAME TRIVIA          1079.680000
GAME WORD              22.433654
Name: growth (30 days), dtype: float64

In [28]:
fig = px.bar(growth_by_category_30, x= growth_by_category_30.index, y=growth_by_category_30, labels={'y':'Growth in 30 days'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Even though games in the action categories get more ratings and were installed more than games in the other categories, games in the casino category have more growth in 30 days. 


- Let's see whether same also true for the 60 days growth

In [29]:
growth_by_category_60 = df.groupby('category')['growth (60 days)'].mean()
growth_by_category_60

category
GAME ACTION          118.294000
GAME ADVENTURE         6.084000
GAME ARCADE           21.970000
GAME BOARD           587.891000
GAME CARD            555.337302
GAME CASINO            2.193000
GAME CASUAL           14.812000
GAME EDUCATIONAL      14.748000
GAME MUSIC            22.160000
GAME PUZZLE           12.062000
GAME RACING           88.963000
GAME ROLE PLAYING      3.037000
GAME SIMULATION       20.196000
GAME SPORTS            8.492000
GAME STRATEGY        435.440000
GAME TRIVIA            6.180000
GAME WORD             55.725000
Name: growth (60 days), dtype: float64

In [30]:
fig = px.bar(growth_by_category_60, x= growth_by_category_60.index, y=growth_by_category_60, labels={'y':'Growth in 60 days'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Nope, growth in 60 days for the games in the casino, adventure, role playing categories are significantly lower than their growth in 30 days. 
- With given dataset, we can only speculate something, but we can not make an analytical assumptions based on the  given data. We need more variables to explain the signifcant differences for some of the categories in 30-60 days growth.

- Lets' see top 3 ranked games in each category in details.

### Top 3 Ranked Games by Category

In [31]:
top_ranked_games = df[df['rank']<4][['rank','title','category', 'total ratings', 'installs_in_million', '5 star ratings']]
top_ranked_games

Unnamed: 0,rank,title,category,total ratings,installs_in_million,5 star ratings
0,1,Garena Free Fire- World Series,GAME ACTION,86273129,500.0,63546766
1,2,PUBG MOBILE - Traverse,GAME ACTION,37276732,500.0,28339753
2,3,Mobile Legends: Bang Bang,GAME ACTION,26663595,100.0,18777988
100,1,Roblox,GAME ADVENTURE,21820451,100.0,16674013
101,2,Pokémon GO,GAME ADVENTURE,14541662,100.0,9517488
102,3,Criminal Case,GAME ADVENTURE,4273420,100.0,3264905
200,1,Subway Surfers,GAME ARCADE,35665901,1000.0,27138572
201,2,Hungry Shark Evolution - Offline survival game,GAME ARCADE,7202013,100.0,5220860
202,3,Geometry Dash Lite,GAME ARCADE,6960814,100.0,4787054
300,1,Ludo King™,GAME BOARD,7512316,500.0,5291589


### Top 3 Games by Category and Their Total Ratings

In [32]:
fig = px.scatter(top_ranked_games, y= 'title', x='total ratings', 
                 hover_data = top_ranked_games[['category','rank']], color='category', 
                 title = "Top 3 Games by Their Total Ratings")
fig.show()

- As mentioned above, games in the action, casual, strategy,arcade, sports categories are getting considerably more ratings than, games in the educational, music categories.
- Same is true even for the top ranked games in these categories.

### Top 3 Games by Category and Their Installs in Millions

In [33]:
fig = px.scatter(top_ranked_games, y= 'title', x='installs_in_million', 
                 hover_data = top_ranked_games[['category','rank']], color='category', 
                 title = "Top 3 Games by Their Installations in Millions")
fig.show()

- As mentioned above, games in the action, arcade and casual categories are installed significantly more than games in the trivia, casino and word categories.

- Same is true even for the top ranked games in these categories.

### Top 3 Games by Category and Their 5 star ratings

In [34]:
fig = px.scatter(top_ranked_games, y= 'title', x='5 star ratings', 
                 hover_data = top_ranked_games[['category','rank']], color='category', 
                 title = "Top 3 Games by 5 Star Rankings")
fig.show()

- Games in the action, casual, strategy,arcade categories also get more 5 star ratings than the games in the educational, music categories.

- And Finally see the top 20 games

### Top 20 Games

In [35]:
top_20 = df.sort_values(by='installs_in_million', ascending=False).head(20)
top_20

Unnamed: 0,rank,title,total ratings,installs_in_million,average rating,growth (30 days),growth (60 days),category,5 star ratings,4 star ratings,3 star ratings,2 star ratings,1 star ratings,paid
200,1,Subway Surfers,35665901,1000.0,4,0.5,1.0,GAME ARCADE,27138572,3366600,1622695,814890,2723142,False
626,1,Candy Crush Saga,31367945,1000.0,4,0.9,1.6,GAME CASUAL,23837448,4176798,1534041,486005,1333650,False
0,1,Garena Free Fire- World Series,86273129,500.0,4,2.1,6.9,GAME ACTION,63546766,4949507,3158756,2122183,12495915,False
207,8,Temple Run,4816448,500.0,4,0.7,1.5,GAME ARCADE,3184391,438320,318164,204384,671187,False
1426,1,Clash of Clans,55766763,500.0,4,0.3,1.0,GAME STRATEGY,43346128,5404966,2276203,971321,3768141,False
1026,1,Hill Climb Racing,10188038,500.0,4,0.4,0.8,GAME RACING,7148370,982941,607603,338715,1110407,False
1326,1,8 Ball Pool,21632735,500.0,4,1.2,630.8,GAME SPORTS,16281475,2268294,1017204,425693,1640067,False
630,5,Pou,11506051,500.0,4,0.2,0.5,GAME CASUAL,8175679,1051014,688712,346244,1244400,False
628,3,My Talking Angela,13050503,500.0,4,0.6,1.4,GAME CASUAL,9165205,1073761,636763,399662,1775110,False
1,2,PUBG MOBILE - Traverse,37276732,500.0,4,1.8,3.6,GAME ACTION,28339753,2164478,1253185,809821,4709492,False


In [36]:
fig = px.bar(top_20, x= 'title', y='installs_in_million', hover_data = top_20[['5 star ratings']], color='category')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- 2 top games have 1 Billion installs.
- 12 following games have 500 million installs.

In [37]:
fig = px.bar(top_20, x= 'title', y='total ratings', hover_data = top_20[['5 star ratings']], color='category')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- It is important to see that, even though Candy Crush Saga and Subway Surfers have 1 Billion installs, it does not automatically mean that, they will get the most total number of ratings.
- Garena Free Fire-World Series with 500 Million installs, it has also more than 86 million total ratings.

## This notebook is a part of the 9 Beginner Friendly EDAs
## If you like this one, you can also check out other notebooks in the Beginner Friendly EDAs series!

* [Data Analyst Jobs - EDA](https://www.kaggle.com/kaanboke/plotly-data-analyst-jobs)
* [Hollywood Top Movies- EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-top-movies)
* [UDEMY Courses EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-udemy)
* [World Happiness Report - EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-eda)
* [Countries Life Expectancy](https://www.kaggle.com/kaanboke/plotly-beginner-friendly)
* [Netflix Movies- EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-netflix)
* [Amazon Top 50 Bestselling Books EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-amazon)
* [London bike Sharing EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-london-bike)


- Thanks for the dataset contibutor for this data. I really enjoyed working on it.

- It was a quite pleasure to share with you this detailed, beginner friendly EDA. Thanks for your time.

- All the best 