# Exploratory Data Analysis of Video Game Data

In this notebook we will explore the data through visualization, fill any missing values and look into feature that we hypothesised may have corrilation with our game popularity as well as explore what we will define as a "popular" video game.

In [1]:
import pandas as pd
from pathlib import Path
import seaborn as sns


import requests
import numpy as np
import pandas_profiling
import tkinter
import matplotlib
import matplotlib.pyplot as plt
matplotlib.use('tkagg')

# I. Steam Purchase Data

Note: Hours column is all 1's, we're going to use this df to see game popularity by number of steam purchases, and later check if this matches up with data from other data sets.

In [2]:
purchase = pd.read_csv("./csv_files/steam_data_purchase_clean.csv", index_col=0)

In [3]:
purchase.head()

In [4]:
purchase.info()

In [5]:
purchase.describe()

In [6]:
purchase['Game'].value_counts()

In [7]:
purchase['Game'].value_counts().describe()

In [8]:
len(pd.unique(purchase['Game']))

There are 5155 different games listed in this data set with a mean 25 sales and a standard deviation of 102 sales. 

Next I want to look at the distribution of these sales, to see what we can decide on as a good base line for determining if a game is popular or not.

In [9]:
pivoted_purchase = purchase.pivot_table(index='Game', values='Hours', aggfunc='count')
pivoted_purchase

In [10]:
pivoted_purchase.describe()

Above confirmed that the pivot that was just done does still give us the same numbers as what we were seeing in the dataframe.

In [11]:
%matplotlib inline
pivoted_purchase.plot(kind='hist', bins=70, figsize=[12,6])
plt.show()

In [12]:
pivoted_purchase[pivoted_purchase['Hours'] > 100].plot(kind='hist', bins=70, figsize=[12,6])
plt.show()

In [13]:
pivoted_purchase[pivoted_purchase['Hours'] > 1000].plot(kind='hist', bins=70, figsize=[12,6])
plt.show()

In [14]:
pivoted_purchase[pivoted_purchase['Hours'] < 100].plot(kind='hist', bins=10, figsize=[12,6])
plt.show()

Our data appears to be very skewed on sales. I don't believe this data set will be a good determinate for popularity as the analysis suggests that our game could be popular if we reach 25 sales.

# II. Steam Play Data

In this data set we intend to look at the hours each game has been played to see if that would be a better determinate for game popularity

In [15]:
play = pd.read_csv("./csv_files/steam_data_play_clean.csv", index_col=0)
play.head()

In [16]:
play.info()

In [17]:
play.describe()

In [18]:
play['Game'].value_counts()

In the first application of .describe() we are analyzing each instance of hours playing a game seperately. We can see that this will be fairly skewed just by seeing the mean as 48.9 hours versus the median of 4.5 hours. Next I would like to see if this remains true if we take the average hours played grouped by the game name

In [19]:
avg_hr_play = play.groupby('Game').mean()
avg_hr_play.head()

In [20]:
avg_hr_play.describe()