## Analysing PS4 sales Data

##### Problem Statement
A teenager wanted to show his mom how many people use PS4 console in the world, so his mom would purchase one for him. So he reached out to Ama to analyze the sales data for PS4 so that his more will have a fair idea of:
1. How many people use PS4 console in the world?
2. Which Game on PS4 console has the highest number of sales?
3. Which Game on PS4 console has the lowest number of sales?

In [1]:
# Importing the necessary libraries 
import pandas as pd
import plotly.express as px
from plotly.offline import iplot
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import chardet

#### Reading The data
This is allows you to know the various rows and columns available.
This allows you to have a clear view of the data, so you will analyse and know what changes to make according to goal of the data cleaning.

In [2]:
with open('PS4_GamesSales.csv', 'rb') as file:
    content = file.read().decode('utf-8', errors='replace')
    with open('cleaned_file.csv', 'w', encoding='utf-8') as clean_file:
        clean_file.write(content)

#Then read the cleaned file
df = pd.read_csv('cleaned_file.csv')

In [None]:
df.head()

In [None]:
#reading the various columns
df.columns

In [3]:
df.sample(5)

Unnamed: 0,Game,Year,Genre,Publisher,North America,Europe,Japan,Rest of World,Global
113,EA Sports UFC 2,2016.0,Sports,EA Sports,0.44,0.65,0.0,0.21,1.31
918,The Legend of Korra (2014),2014.0,Action,Activision,0.0,0.0,0.0,0.0,0.0
32,Destiny 2,2017.0,Shooter,Activision,1.92,1.44,0.1,0.69,4.14
804,Mercenary Kings,,Misc,,0.0,0.0,0.0,0.0,0.0
233,Skylanders: SuperChargers,2015.0,Action-Adventure,Activision,0.29,0.09,0.0,0.08,0.45


In [None]:
#display the available rows and columns
df.shape

In [None]:
#Reading the information about the dataset
df.info()

In [None]:
#Using missingno to visualise the missing value in the dataset
import missingno as msno
msno.bar(df)

In [4]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,825.0,2015.966061,1.29836,2013.0,2015.0,2016.0,2017.0,2020.0
North America,1034.0,0.204613,0.563471,0.0,0.0,0.02,0.12,6.18
Europe,1034.0,0.248714,0.785491,0.0,0.0,0.0,0.13,9.71
Japan,1034.0,0.033636,0.108344,0.0,0.0,0.0,0.03,2.17
Rest of World,1034.0,0.089014,0.24941,0.0,0.0,0.01,0.05,3.02
Global,1034.0,0.576054,1.583534,0.0,0.0,0.06,0.3575,19.39


#### Data Cleaning 
I will start by Changing some of the names of the columns, dropping Null values, dropping zeros, and changing the datatype of some of the columns.

### Changing  column names to suite the specified goal stated earlier

In [5]:
# changing year to year_published
df.rename(columns={"Year":"Year_Published", "Global":"Total Sales"}, inplace=True)
df.head()

Unnamed: 0,Game,Year_Published,Genre,Publisher,North America,Europe,Japan,Rest of World,Total Sales
0,Grand Theft Auto V,2014.0,Action,Rockstar Games,6.06,9.71,0.6,3.02,19.39
1,Call of Duty: Black Ops 3,2015.0,Shooter,Activision,6.18,6.05,0.41,2.44,15.09
2,Red Dead Redemption 2,2018.0,Action-Adventure,Rockstar Games,5.26,6.21,0.21,2.26,13.94
3,Call of Duty: WWII,2017.0,Shooter,Activision,4.67,6.21,0.4,2.12,13.4
4,FIFA 18,2017.0,Sports,EA Sports,1.27,8.64,0.15,1.73,11.8


In [None]:
# confirming if the column name has been changed successfully
df.info()

### Checking DataType and nulls values

In [6]:
# counting null in the dataset
df.isnull().sum()

Game                0
Year_Published    209
Genre               0
Publisher         209
North America       0
Europe              0
Japan               0
Rest of World       0
Total Sales         0
dtype: int64

In [7]:
#dropping the null values 
df = df.dropna(subset=['Year_Published','Publisher'])

In [8]:
# checking if the nulls were successfully dropped 
df.isna().sum()

Game              0
Year_Published    0
Genre             0
Publisher         0
North America     0
Europe            0
Japan             0
Rest of World     0
Total Sales       0
dtype: int64

In [9]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year_Published,825.0,2015.966061,1.29836,2013.0,2015.0,2016.0,2017.0,2020.0
North America,825.0,0.256448,0.620259,0.0,0.0,0.05,0.19,6.18
Europe,825.0,0.3116,0.868271,0.0,0.0,0.02,0.22,9.71
Japan,825.0,0.042048,0.119814,0.0,0.0,0.0,0.04,2.17
Rest of World,825.0,0.111552,0.274713,0.0,0.0,0.02,0.09,3.02
Total Sales,825.0,0.721721,1.743122,0.0,0.03,0.12,0.56,19.39


#### Checking Zero values count in the dataset

In [None]:
#count zero values in runtime data using group by
df.groupby("Total Sales").count()

since the zero count is less we are going to drop them

In [10]:
#dropping the zero values in runtime column
df = df[df['Total Sales'] != 0]

In [None]:
df

## EXPLORATORY DATA ANALYSIS

Q1. 
How many people use PS4 in the world?

In [None]:
# Printing the total sales made globally an the games
sales = df['Total Sales'].sum()
print("Total Sales: ", sales)

Q2. 
Which Game on PS4 has the highest number of sales?

In [None]:
# Game with the most sales
max = df.loc[df['Total Sales'].idxmax()]
print("Game With The Highest Number of Sales: ", max)

Q3.
Which Game on PS4 has the lowest number of sales?

In [None]:
# Game with the fewer sales
min = df.loc[df['Total Sales'].idxmin()]
print("Game With The Lowest Number of Sales: ", min)

In [None]:
# Bar Graph showing the number of games published each year 
# df['Year_Published'].value_counts().plot(kind='bar')
# plt.show()
df['Year_Published'].value_counts()
plt.bar(df['Year_Published'].value_counts().index, df['Year_Published'].value_counts().values)
plt.show()

In [None]:
# Pie Chart For Games number of Games release each year
# df['Year'].value_counts().plot(kind='pie')
# plt.show()
data = df['Year_Published'].value_counts()
plt.pie(df['Year_Published'].value_counts().values, labels=df['Year_Published'].value_counts().index)
plt.show()

In [16]:
fig = px.pie(df,
            names = "Year_Published",
            template = "plotly_dark",
            color_discrete_sequence = px.colors.sequential.RdBu,
            color = "Year_Published",
            hole = 0.4,
            title = "Games Published each Year Per Percentage")
iplot(fig)