# Exploratory Data Analysis of Games from 1970 to 2016

This dataset from Kaggle contains the data of games released from 1970 to 2016, we can obtain the following things of a game from this dataset:
1. the **Release Date** of the game
2. the **Genre** of the game
3. which **Platform** the game supports
4. the **Score** of the game
5. the **Title(Name)** of the game

From the data we have above, this Exploratory Data Analysis is trying to answer the questions below:
1. When is the best time to release a game?
2. What platforms are popular?
3. What kinds of genre are popular?
4. What are the main topics of the popular games?
5. Others

# Import Packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
import wordcloud 
import scipy.misc
warnings.filterwarnings('ignore')
%matplotlib inline

ModuleNotFoundError: No module named 'wordcloud'

# Import Data

In [None]:
path = os.getcwd()
data = pd.read_csv(path + '/Data/ign.csv')

In [None]:
data.info()

We can see that there are NAs in __genre__ column, so the next step is to remove them

In [None]:
data_c = data.dropna()

In [None]:
data_c.info()

In [None]:
data_c.head()

In [None]:
data_c.describe()

# Question 1: When is the best time to release a game?

To answer this question, let's first look at the distrbution of the release year

In [None]:
sns.countplot(data_c['release_year'],
             color = 'blue')
plt.title('Number of games released in year')
plt.xticks(rotation = 90);

Here we can see that the data points in 1970 are outliers, therefore we should remove them from our analysis

In [None]:
data_cc = data_c[data_c['release_year'] > 1970]

In [None]:
data_cc.describe()

In [None]:
sns.countplot(data_cc['release_year'],
             color = 'blue')
plt.title('Number of games released in each year')
plt.xticks(rotation = 90);

Without having the outliers, apparently the number of games are increasing from 1996 and reaches the peak in the year of 2008, then followed by a decreasing till 2016

Then, let's look at the distrbution of the release month

In [None]:
sns.countplot(data_cc['release_month'],
             color = 'blue')
plt.title('Number of games released in each month');

Here we can see that game makers prefer release their games at the end of the year, then what about the weekdays?

In [None]:
data_cc['Date'] = pd.to_datetime(data_cc['release_year'].astype(str) + '-' + data_cc['release_month'].astype(str) + '-' + data_cc['release_day'].astype(str))

In [None]:
data_cc['Weekday'] = data_cc['Date'].dt.weekday_name

In [None]:
sns.countplot(data_cc['Weekday'],
             color = 'blue',
             order = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'])

plt.title('Number of games released in each weekday')
plt.xticks(rotation = 20);

It is interesting that game makers prefer release their games during the workdays

__Brief Summary__:  
The year of 2008 has the most of the games released and game makers like releasing their game at the end of the year and during the workdays

We have explored the relationship between the number of games and the time, how about the relationship between the feedbacks(score) and the time?  
First, let's look at how scores distribute

In [None]:
sns.distplot(data_cc['score'],
             kde=False,
             bins = 10,
             hist_kws=dict(edgecolor="k"))
plt.ylabel('count')
plt.title('Distribution of scores');

The distribution of scores is left skewed and the most of the scores are gathering around 8, which means most of the games have a relatively high score. Then let's find out is there any good time for game makers to release their game to gain a high score.

In [None]:
year_score = pd.DataFrame(data_cc.groupby('release_year')['score'].mean()).reset_index()
plt.figure(figsize=(22,10))
plt.subplot(2,2,1)
sns.barplot(x = 'release_year',
            y = 'score',
            data = year_score,
            color = 'blue')
plt.xticks(rotation = 35, size = 20)
plt.xlabel('Release Year', size = 20)
plt.yticks(size = 20)
plt.ylabel('Score', size = 20)
plt.title('Year VS Score', size = 20)
plt.subplot(2,2,2)
month_score = pd.DataFrame(data_cc.groupby('release_month')['score'].mean()).reset_index()
sns.barplot(x = 'release_month',
            y = 'score',
            data = month_score,
            color = 'blue')
plt.xticks(size = 20)
plt.xlabel('Release Month', size = 20)
plt.yticks(size = 20)
plt.ylabel('Score', size = 20)
plt.title('Month VS Score', size = 20)
plt.subplot(2,1,2)
wday_score = pd.DataFrame(data_cc.groupby('Weekday')['score'].mean()).reset_index()
sns.barplot(x = 'Weekday',
            y = 'score',
            data = wday_score,
            color = 'blue',
           order = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'])
plt.xticks(size = 20)
plt.xlabel('Release Weekday', size = 20)
plt.yticks(size = 20)
plt.ylabel('Score', size = 20)
plt.title('Weekday VS Score', size = 20)
plt.tight_layout()

There are not huge differences in terms of score in difference time peroid. So maybe the release time is not a main factor that influences the score of a game. But games released in September or on Sunday have a reletively high score.

# Question 2: What platforms are popular?

For this part, I am going to do the analysis from two aspects: from the view of __game makers__ and from the view of __players__

## For game makers

If a platform has lots of game released on it, then I consider it as a popular platform for game makers.

In [None]:
platform_makers = pd.DataFrame(data_cc.groupby('platform')['title'].count()).sort_values(by = 'title', ascending=False).head(20).reset_index()
platform_makers = platform_makers.rename(columns={'title':'count'})
sns.barplot(y = 'platform',
            x = 'count',
            data = platform_makers,
            color = 'blue')
plt.title('TOP 20 popular platforms for game makers');

In [None]:
platform_makers['percentage'] = (round(platform_makers['count'] * 100 / 18589.0, 2)).astype(str) + '%'
platform_makers

By plotting the top 20 popular platform for game makers, we can see that PC is the most popular platform for them, 18.11% of the games released from 1970 to 2016 support the PC platform. 

## For players

An important indicator to find which games are popular among the players is the score rated by themselves, if a certain platform has lots of high rated games released on it, then I consider it as a popular platform for players.

In [None]:
platform_players = pd.DataFrame(data_cc.groupby('platform')['score'].mean()).sort_values(by = 'score', ascending=False).head(20).reset_index()
sns.barplot(y = 'platform',
            x = 'score',
            data = platform_players,
            color = 'blue')
plt.title('TOP 20 popular platforms for players');

* Here we can see that for players, the case is different, SteamOS is the platform that has the highest average score of games released on it, while PC is not on the TOP 20 list. <br>
* But at the same time, there are not so many games released on the platform of SteamOS as well. <br>
* Platforms that on the both of the TOP lists: Playstation 4, Xbox One, Nintendo 3DS.

# Question 3: What kinds of genre are popular?

Similarly, I am also going to seperate this question into two parts: __game makers'__ side and __players'__ side

## For game makers

In [None]:
platform_makers = pd.DataFrame(data_cc.groupby('genre')['title'].count()).sort_values(by = 'title', ascending=False).head(20).reset_index()
platform_makers = platform_makers.rename(columns={'title':'count'})
sns.barplot(y = 'genre',
            x = 'count',
            data = platform_makers,
            color = 'blue')
plt.title('TOP 20 popular genres for game makers');

In [None]:
platform_makers['percentage'] = (round(platform_makers['count'] * 100 / 18589.0, 2)).astype(str) + '%'
platform_makers

Action game is the most popular game for the game makers, 20% of the games are action game, almost more than twice as many as the second one. 50% of the games made during 1970 to 2016 are Action, Sports, Shooter, Racing and Adventure games.

## For players

In [None]:
platform_makers = pd.DataFrame(data_cc.groupby('genre')['score'].mean()).sort_values(by = 'score', ascending=False).head(20).reset_index()
sns.barplot(y = 'genre',
            x = 'score',
            data = platform_makers,
            color = 'blue')
plt.title('TOP 20 popular genres for players');

Compilation has the highest score among all the game genres (this may because the game makers only choose the popular games together to make the compilation)

# Question4: What are the titles of the popular games?

Here we select TOP 50% games for players to analyze.

In [None]:
top_score = data_cc.sort_values(by = 'score', ascending = False).head(int(data_cc.shape[0]/2))

In [None]:
long_string = ' '.join(top_score['title'].tolist())
im = scipy.misc.imread(path + '/Data/PS copy.jpg')
wordclouds = wordcloud.WordCloud(stopwords=wordcloud.STOPWORDS, 
                                 mask = im, 
                                 background_color='white',
                                 scale = .5)
wordclouds.generate(long_string)
wordclouds.to_image()

Above is the wordcloud about the TOP 50% popular games' titles, here we can see __Call Duyt__, __Final Fantasy__ and __Tom Clancy__ etc. 

# Question 5: Others

In this section, I am going to comine the variables above, trying to find is there any other interesting things that cannnot be found by only analyzing one variable.

## Year/Month and Genre

In [None]:
genres_count = pd.DataFrame(data_cc.groupby(by = 'genre')['title'].count()).sort_values(by = 'title', ascending = False)[:20].reset_index()
top_genres = data_cc[data_cc['genre'].isin(genres_count['genre'])]
year_genre = top_genres.groupby(['release_year', 'genre'])['title'].count().reset_index().sort_values(by = 'title')
year_genre = pd.pivot_table(year_genre,
                            values = 'title', 
                            index = 'release_year', 
                            columns = 'genre')
plt.figure(figsize=(15,7))
sns.heatmap(data = year_genre, 
            annot=True, 
            fmt='3.0f', 
            cmap="YlGnBu", linewidth = .5)
plt.title('Year VS Genre');

Here we can see that most of the games reach their top released number in the year of 2008, but __Racing, Action__ game reached its top in 2000.

In [None]:
year_genre = top_genres.groupby(['release_month', 'genre'])['title'].count().reset_index().sort_values(by = 'title')
year_genre = pd.pivot_table(year_genre,
                            values = 'title', 
                            index = 'release_month', 
                            columns = 'genre')
plt.figure(figsize=(15,7))
sns.heatmap(data = year_genre, 
            annot=True, 
            fmt='3.0f', 
            cmap="YlGnBu", linewidth = .5)
plt.title('Year VS Genre');

For month, there is no significant differnece from the analysis before, most of the games are released at the end of the year.

## Year VS Platform

In [None]:
genres_count = pd.DataFrame(data_cc.groupby(by = 'platform')['title'].count()).sort_values(by = 'title', ascending = False).reset_index()[:20]
top_genres = data_cc[data_cc['platform'].isin(genres_count['platform'])]
year_genre = top_genres.groupby(['release_year', 'platform'])['title'].count().reset_index().sort_values(by = 'title')
year_genre = pd.pivot_table(year_genre,
                            values = 'title', 
                            index = 'release_year', 
                            columns = 'platform')
plt.figure(figsize=(15,7))
sns.heatmap(data = year_genre, 
            annot=True, 
            fmt='3.0f', 
            cmap="YlGnBu", linewidth = .5)
plt.title('Year VS Genre');

It is not suprise to see that there are lots of blanks in the heatmap above, because some platforms only appear recently and some of them disapperaed over time. So here we can see that PC platform has last a really long time but the number of games released on it is decreasing after 2008, and such number on PlayStation Series folllows the same pattern, where the number of games released on it is small at the beginning and the ending of its product life cycle and reachs its top at the middle, the form of the distribution is like a bell. So we can predict that there will be more games releasedon PS4 after 2016. And this is also the case happens on Xbox Series.