---

#  Top Games on Google Playstore - EDA

---

 - In this study, we aimed to make an Exploratory Data Analysis (EDA) by using Top Games on Google Playstore dataset with very terse but clear explanations.

---

- We are going to start by importing the libraries we will be using during the study and then can start to explore our dataset.

- We are going to use both Seaborn and Plotly to have variety of visualization options.

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

import warnings
warnings.filterwarnings("ignore")

## Overview Stage 

- Read the csv
- Use necessary functions to get basic informations about the dataset

In [10]:
df = pd.read_csv('android-games.csv')

In [11]:
df.head()

Unnamed: 0,rank,title,total ratings,installs,average rating,growth (30 days),growth (60 days),price,category,5 star ratings,4 star ratings,3 star ratings,2 star ratings,1 star ratings,paid
0,1,Garena Free Fire- World Series,86273129,500.0 M,4,2.1,6.9,0.0,GAME ACTION,63546766,4949507,3158756,2122183,12495915,False
1,2,PUBG MOBILE - Traverse,37276732,500.0 M,4,1.8,3.6,0.0,GAME ACTION,28339753,2164478,1253185,809821,4709492,False
2,3,Mobile Legends: Bang Bang,26663595,100.0 M,4,1.5,3.2,0.0,GAME ACTION,18777988,1812094,1050600,713912,4308998,False
3,4,Brawl Stars,17971552,100.0 M,4,1.4,4.4,0.0,GAME ACTION,13018610,1552950,774012,406184,2219794,False
4,5,Sniper 3D: Fun Free Online FPS Shooting Game,14464235,500.0 M,4,0.8,1.5,0.0,GAME ACTION,9827328,2124154,1047741,380670,1084340,False


- We basically can say that we have a dataset which is about top games in Google Playstore including the titles, average ratings, numbers of installation, ratings and the price of each game.

In [13]:
df.shape

(1730, 15)

- The dataset has 1730 rows and 15 columns.

- To have null values in a dataset and the number of null values have very crucial effect on analyzing.
- To be aware of the missing values, I would like to check the null values in the dataset.

In [14]:
df.isnull().sum()

rank                0
title               0
total ratings       0
installs            0
average rating      0
growth (30 days)    0
growth (60 days)    0
price               0
category            0
5 star ratings      0
4 star ratings      0
3 star ratings      0
2 star ratings      0
1 star ratings      0
paid                0
dtype: int64

- Even though having no null values in the dataset makes me very happy, it is very rare situatin in the ral world.
- Since it is kind of a dream dataset, let's enjoy it together :)

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1730 entries, 0 to 1729
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rank              1730 non-null   int64  
 1   title             1730 non-null   object 
 2   total ratings     1730 non-null   int64  
 3   installs          1730 non-null   object 
 4   average rating    1730 non-null   int64  
 5   growth (30 days)  1730 non-null   float64
 6   growth (60 days)  1730 non-null   float64
 7   price             1730 non-null   float64
 8   category          1730 non-null   object 
 9   5 star ratings    1730 non-null   int64  
 10  4 star ratings    1730 non-null   int64  
 11  3 star ratings    1730 non-null   int64  
 12  2 star ratings    1730 non-null   int64  
 13  1 star ratings    1730 non-null   int64  
 14  paid              1730 non-null   bool   
dtypes: bool(1), float64(3), int64(8), object(3)
memory usage: 191.0+ KB


- According to output of info function, since we have mainly integers and the floats as data types, I can say that we have a easy-to-analyze dataset.
- Another point which takes my attention immeadiately is that, even though 'installs' column exhibits the number of installation, it has object Dtype. 
- To avoid potential problems, we better change the type to integer or float.

In [16]:
df.describe()

Unnamed: 0,rank,total ratings,average rating,growth (30 days),growth (60 days),price,5 star ratings,4 star ratings,3 star ratings,2 star ratings,1 star ratings
count,1730.0,1730.0,1730.0,1730.0,1730.0,1730.0,1730.0,1730.0,1730.0,1730.0,1730.0
mean,50.386705,1064332.0,3.908092,321.735896,122.554971,0.010942,762231.5,116436.6,57063.07,27103.36,101495.0
std,28.936742,3429250.0,0.290973,6018.914507,2253.891703,0.214987,2538658.0,302163.1,149531.4,81545.42,408374.5
min,1.0,32993.0,2.0,0.0,0.0,0.0,13975.0,2451.0,718.0,266.0,545.0
25%,25.0,175999.2,4.0,0.1,0.2,0.0,127730.0,20643.0,9652.5,4262.25,12812.0
50%,50.0,428606.5,4.0,0.5,1.0,0.0,296434.0,50980.5,25078.0,10675.5,33686.0
75%,75.0,883797.0,4.0,1.7,3.3,0.0,619835.8,101814.0,52295.0,23228.75,80157.25
max,100.0,86273130.0,4.0,227105.7,69441.4,7.49,63546770.0,5404966.0,3158756.0,2122183.0,12495920.0


To summarize what we have got so far ;
- We have got a dataset which has 1730 rows and 15 columns, about detailed information related top games on Google Playstore.
- Since we don't have any null values and most commonly have numeric values, we are not going to need to many adjustments.
- Even though it looks quite all right, to make an adjustment on the install column will make analyzing easier.
- Another point that we might need to take care is that price and paid column have a lot in common. Most likely to study with one od them is going to be enough, which means we should drop one of them.
- Just for further steps, to have in mind, we should be aware of the uneven distribution of price column and the possible outliers on the rank column.