## Project Name: PS4 Games Sales Data Analysis 

### 1. Importing Libraries:

Numpy is requried whenever calculations are required (calculating means, medians, sqaure root, etc.).  Pandas is a great module for data processing and data frames.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from termcolor import colored

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### 2. Loading Dataset &amp; Exploratory Data Analysis:

We would use the pandas module to read the files. Using the "read_csv" function. The files format is .csv.

In [3]:
data = pd.read_csv('PS4_GamesSales.csv', header=0,encoding='unicode_escape')
pd.set_option('display.max_columns',None)
pd.set_option("display.max_rows",None)

print('The data has the shape of', data.shape[0], 'Rows and', data.shape[1], 'columns')

print(data.info())
print(data.describe())


The data has the shape of 1034 Rows and 9 columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1034 entries, 0 to 1033
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Game           1034 non-null   object 
 1   Year           825 non-null    float64
 2   Genre          1034 non-null   object 
 3   Publisher      825 non-null    object 
 4   North America  1034 non-null   float64
 5   Europe         1034 non-null   float64
 6   Japan          1034 non-null   float64
 7   Rest of World  1034 non-null   float64
 8   Global         1034 non-null   float64
dtypes: float64(6), object(3)
memory usage: 72.8+ KB
None
              Year  North America       Europe        Japan  Rest of World  \
count   825.000000    1034.000000  1034.000000  1034.000000    1034.000000   
mean   2015.966061       0.204613     0.248714     0.033636       0.089014   
std       1.298360       0.563471     0.785491     0.108344       0.

The "head" fucntion displays the first 5 rows of the data frame

In [None]:
print(data.head())

In [None]:
data.tail(10)

In [None]:
data.hist(figsize=(15,15))
plt.show()
print(data.shape)
print(data.columns)

### Pair Plot

Plot pairwise relationships in a dataset. By default, this function will create a grid of Axes such that each numeric variable in data will by shared across the y-axes across a single row and the x-axes across a single column. The diagonal plots are treated differently: a univariate distribution plot is drawn to show the marginal distribution of the data in each column. It is also possible to show a subset of variables or plot different variables on the rows and columns.

In [None]:
sns.pairplot(data)

### CorrelationStrength of the relationship between two variables

In [None]:
data.corr()

In [None]:
f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap(data.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)
plt.show()

Huge matrix that includes a lot of numbers. The range of this numbers are -1 to 1. Meaning of 1 is two variable are positively correlated with each other like radius mean and area mean. Meaning of zero is there is no correlation between variables like radius mean and fractal dimension seMeaning of -1 is two variables are negatively correlated with each other like radius mean and fractal dimension mean.Actually correlation between of them is not -1, it is -0.3 but the idea is that if sign of correlation is negative that means that there is negative correlation.

### 3. Missing Values Distribution:

DataFrame isna() detect missing values. Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy NaN, gets mapped to True values. Everything else gets mapped to False values.

In [None]:
plt.figure(figsize=(20,6))

na = pd.DataFrame(data.isna().sum())

sns.barplot(y=na[0], x=na.index)
plt.title('Missing Values Distribution', size = 20, weight='bold')
print(colored("Missing values column wise -", 'magenta'))
print(colored(data.isna().sum(), 'magenta'))

plt.show()

### 4. Calculate the % of missing data from each column

In [None]:
nan_ratio = []
for col in data.columns:
    nan_item = []
    nan_item.append(col)
    nan_item.append(data[col].isnull().sum())
    nan_item.append(str(round(100 * data[col].isnull().sum() / data.shape[0], 2)) + '%')
    nan_ratio.append(nan_item)

df_nan = pd.DataFrame(nan_ratio, columns=["Column", "NaN count", "NaN ratio"]).set_index("Column")
df_nan = df_nan.sort_values("NaN ratio", ascending=False)
print(df_nan.astype(object).T)

In [None]:
from sklearn.impute import SimpleImputer
data = data.replace('N.V.', np.nan)

s_imp = SimpleImputer(strategy = 'most_frequent', verbose = 1).fit(data[['Year', 'Publisher']])
data[['Year', 'Publisher']] = s_imp.transform(data[['Year', 'Publisher']])


sns.heatmap(data.corr(), annot=True, cmap='coolwarm')

In [None]:
data['Year'] = data['Year'].astype('int64')

In [None]:
print(data.isnull().sum().sort_values())

### 5. Which genre is the most popular?

Seaborn countplot show the counts of observations in each categorical bin using bars.

In [None]:
plt.figure(figsize=(13, 8))

sns.countplot(data=data, x='Genre', order=data['Genre'].value_counts().keys(), palette="Set2").set_xlim(0, 9)

plt.show()

In [None]:
plt.figure(figsize=(13, 8))

sns.countplot(data=data, x='Year', order=data['Year'].value_counts().keys(), palette="Set2").set(title='Count by Year')

plt.show()

In [None]:
x = data.Genre.value_counts(sort=False)

fig, ax = plt.subplots(figsize=(10, 10))

ax.pie(x, labels = data.Genre.unique(), autopct='%.1f%%', pctdistance=.75, startangle=24,
       textprops={"fontsize":12}, wedgeprops={'edgecolor':'#383838'})

ax.set_title('Genre by all publisher', fontdict={'fontsize':14})
ax.legend([], bbox_to_anchor=(.8, .77))

fig = plt.gcf()

plt.tight_layout()
plt.show()

In [None]:
print(data['Game'].value_counts().head(5))

Count number of distinct elements in specified axis.

In [None]:
print(data['Game'].nunique())

### 6. Publisher

The unique() function is used to find the unique elements of an array.

In [None]:
print(data.Publisher.nunique())


In [None]:
 x = data['Publisher'].value_counts().head()
 print(x)

In [None]:


fig, ax = plt.subplots(figsize=(6, 6))

ax.pie(x, labels = ('Namco Bandai Games ', 'Sony Interactive Entertainment ', 'Ubisoft ', 'Square Enix ', 'Tecmo Koei'), autopct='%.1f%%', pctdistance=.75, startangle=24,
       textprops={"fontsize":12}, wedgeprops={'edgecolor':'#383838'})

ax.set_title('Best Seller Publisher in %', fontdict={'fontsize':14})
ax.legend([], bbox_to_anchor=(.8, .77))
centre_circle = plt.Circle((0,0),0.55,fc='white', ec='#383838')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(13, 8))
publisher = data['Publisher'].value_counts(ascending=False).head()
sns.barplot(publisher.index, publisher.values).set(title='Best Seller Publisher')



### 7. Namco Bandai Games:

In [None]:
games = data[data["Publisher"] == "Namco Bandai Games"]
print(games.head())


Zero value check:

In [None]:
print(games.isnull().sum().sort_values())

### What is the best selling genre by Namco Bandai Games?

Creating a pie plot :

In [None]:
print(games['Genre'].value_counts().head())

In [None]:
print(x.keys())

In [None]:
x = games['Genre'].value_counts().head()

fig, ax = plt.subplots(figsize=(8, 8))

ax.pie(x, labels = x.keys(), autopct='%.1f%%', pctdistance=.75, startangle=24,
       textprops={"fontsize":12}, wedgeprops={'edgecolor':'#383838'})

ax.set_title('The best-selling genre by Namco Bandai Games', fontdict={'fontsize':14})
ax.legend([], bbox_to_anchor=(.8, .77))

fig = plt.gcf()

plt.tight_layout()
plt.show()

In [None]:
print(games['Year'].value_counts().head())

### How did the total sales change in North America, Europe, Japan, other area from 2013 to 2020?

Finding maximum sales values :

In [None]:
n1 = np.around( data['North America'].max(), 2)
print('Max Sales Value in North America is : ', n1)

n2 = np.around( data['Europe'].max(), 2)
print('Max Sales Value in Europe is : ', n2)

n3 = np.around( data['Japan'].max(), 2)
print('Max Sales Value in Japan is : ', n3)

n4 = np.around( data['Rest of World'].max(), 2)
print('Max Sales Value in Rest of World is : ', n4)


In [None]:
n = n1, n2, n3, n4
print(n)

In [None]:
fig, ax = plt.subplots(figsize=(6, 6))

ax.pie(n, labels = ('North America', 'Europe ', 'Japan ', 'Rest of World'), autopct='%.1f%%', pctdistance=.75, startangle=24,
       textprops={"fontsize":12}, wedgeprops={'edgecolor':'#383838'})

ax.set_title('The best market', fontdict={'fontsize':14})
ax.legend([], bbox_to_anchor=(.8, .77))
centre_circle = plt.Circle((0,0),0.55,fc='white', ec='#383838')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.tight_layout()
plt.show()

Market comparison :

In [None]:
plt.style.use('ggplot')
fig, ax = plt.subplots(figsize=(20,6))

x = data.groupby(by='Year')['Europe'].sum()
y = data.groupby(by='Year')['Japan'].sum()
z = data.groupby(by='Year')['North America'].sum()
ax.plot(x)
ax.plot(y)
ax.plot(z)
ax.set_xlabel('Year')
ax.set_ylabel('Sales')
plt.show()




In [None]:


fig, ax = plt.subplots(nrows= 2, ncols=2, figsize=(14, 15), sharex=False)
ax[0, 0].plot(data.groupby('Year').Europe.sum())
ax[0, 0].set_title('Sales in Europe')

ax[0, 1].plot(data.groupby('Year').Japan.sum())
ax[0, 1].set_title('Sales in Japan')

ax[1, 0].plot(data.groupby('Year')['North America'].sum())
ax[1, 0].set_title('Sales in North America')

ax[1, 1].plot(data.groupby('Year')['Rest of World'].sum())
ax[1, 1].set_title('Sales in Rest of World')

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=bcf94c8c-3d39-466d-930e-f464d3a731e0' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>