# Predicting Video Game Sales
First of all, we need to import some libraries.

In [4]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd

# Allows the use of display() for DataFrames
from IPython.display import display

# Pretty display for notebooks
%matplotlib inline

# Data Schema

## Dataset 1 (Video_Games_Sales_as_at_22_Dec_2016.csv)
This is my main dataset. It will be used to predict video game sales.

| title           | description                                                   | data type |
|-----------------|---------------------------------------------------------------|-----------|
| Name            | Name of the game                                              | String    |
| Platform        | Platform                                                      | String    |
| Year_of_Release | Year of release                                               | Numeric   |
| Genre           | Genre                                                         | String    |
| Publisher       | Publisher                                                     | String    |
| NA_Sales        | Game sales in North America (in millions of units)            | Numeric   |
| EU_Sales        | Game sales in the European Union (in millions of units)       | Numeric   |
| JP_Sales        | Game sales in Japan (in millions of units)                    | Numeric   |
| Other_Sales     | Game sales in the rest of the world (in millions of units)    | Numeric   |
| Global_Sales    | Total sales in the world (in millions of units)               | Numeric   |
| Critic_Score    | Aggregate score compiled by Metacritic staff                  | Numeric   |
| Critic_Count    | The number of critics used in coming up with the Critic_score | Numeric   |
| User_Score      | Score by Metacritic's subscribers                             | Numeric   |
| User_Count      | Number of users who gave the user_score                       | Numeric   |
| Developer       | Party responsible for creating the game                       | String    |
| Rating          | The [ESRB](https://www.esrb.org/) ratings                     | String    |

## Dataset 2 (ign.csv)
This is my secondary dataset. It might be used to support the primary dataset, but this would probably require major effort for merging both.

| title           | description                                                   | data type |
|-----------------|---------------------------------------------------------------|-----------|
| ID              | ID                                                            | Numeric   |
| score_phrase    | Game rating as a phrase                                       | String    |
| title           | Name of the game                                              | String    |
| url             | URL to source of data                                         | String    |
| platform        | Platform                                                      | String    |
| score           | Game rating as a float                                        | Numeric   |
| genre           | Genre                                                         | String    |
| editors_choice  | Special recommendation of the editor                          | String    |
| release_year    | Year of release                                               | Numeric   |
| release_month   | Month of release                                              | Numeric   |
| release_day     | Day of release                                                | Numeric   |


We can remove several colums in dataset2 that we don't need:
* **ID** which is just an ID that's not used in dataset 1
* **score_phrase**, because it redundant to _score_ and less precise
* **url**, because that's just the origin of the data
The next step will be to load the datasets.

In [5]:
# Load dataset #1
try:
    dataset1 = pd.read_csv("data/Video_Games_Sales_as_at_22_Dec_2016.csv")
    print "Dataset #1 has {} samples with {} features each.".format(*dataset1.shape)
except:
    print "Dataset #1 could not be loaded. Is the dataset missing?"

print('\n')

Dataset #1 has 16719 samples with 16 features each.




In [6]:
# Display a description of the datasets
display(dataset1.dtypes)
display(dataset1.head())
display(dataset1.describe())
print('\n')

Name                object
Platform            object
Year_of_Release    float64
Genre               object
Publisher           object
NA_Sales           float64
EU_Sales           float64
JP_Sales           float64
Other_Sales        float64
Global_Sales       float64
Critic_Score       float64
Critic_Count       float64
User_Score          object
User_Count         float64
Developer           object
Rating              object
dtype: object

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


Unnamed: 0,Year_of_Release,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Count
count,16450.0,16719.0,16719.0,16719.0,16719.0,16719.0,8137.0,8137.0,7590.0
mean,2006.487356,0.26333,0.145025,0.077602,0.047332,0.533543,68.967679,26.360821,162.229908
std,5.878995,0.813514,0.503283,0.308818,0.18671,1.547935,13.938165,18.980495,561.282326
min,1980.0,0.0,0.0,0.0,0.0,0.01,13.0,3.0,4.0
25%,2003.0,0.0,0.0,0.0,0.0,0.06,60.0,12.0,10.0
50%,2007.0,0.08,0.02,0.0,0.01,0.17,71.0,21.0,24.0
75%,2010.0,0.24,0.11,0.04,0.03,0.47,79.0,36.0,81.0
max,2020.0,41.36,28.96,10.22,10.57,82.53,98.0,113.0,10665.0






Clean up all strings

In [7]:
for i in range(0, len(dataset1.axes[1])):
    if (dataset1.ix[:,i].dtype == object):
        dataset1.ix[:,i] = dataset1.ix[:,i].str.strip()

## tbd to NaN
dataset1 = dataset1.replace('tbd', float('NaN'))

In order to understand the dataset better, we should have a look at possible values.

In [8]:
print ('DATASET 1')
for i in range(0, len(dataset1.axes[1])):
    if (dataset1.ix[:,i].dtype == object):
        tmp = dataset1.ix[:,i].unique()
        tmp.sort()
        display (tmp)

DATASET 1


array([nan, "'98 Koshien", '.hack//G.U. Vol.1//Rebirth', ...,
       'uDraw Studio: Instant Artist', 'wwe Smackdown vs. Raw 2006',
       '\xc2\xa1Shin Chan Flipa en colores!'], dtype=object)

array(['2600', '3DO', '3DS', 'DC', 'DS', 'GB', 'GBA', 'GC', 'GEN', 'GG',
       'N64', 'NES', 'NG', 'PC', 'PCFX', 'PS', 'PS2', 'PS3', 'PS4', 'PSP',
       'PSV', 'SAT', 'SCD', 'SNES', 'TG16', 'WS', 'Wii', 'WiiU', 'X360',
       'XB', 'XOne'], dtype=object)

array([nan, 'Action', 'Adventure', 'Fighting', 'Misc', 'Platform',
       'Puzzle', 'Racing', 'Role-Playing', 'Shooter', 'Simulation',
       'Sports', 'Strategy'], dtype=object)

array([nan, '10TACLE Studios', '1C Company',
       '20th Century Fox Video Games', '2D Boy', '3DO', '49Games',
       '505 Games', '5pb', '7G//AMES', '989 Sports', '989 Studios',
       'AQ Interactive', 'ASC Games', 'ASCII Entertainment',
       'ASCII Media Works', 'ASK', 'Abylight', 'Acclaim Entertainment',
       'Accolade', 'Ackkstudios', 'Acquire', 'Activision',
       'Activision Blizzard', 'Activision Value', 'Adeline Software',
       'Aerosoft', 'Agatsuma Entertainment', 'Agetec', 'Aksys Games',
       'Alawar Entertainment', 'Alchemist', 'Alternative Software',
       'Altron', 'Alvion', 'American Softworks', 'Angel Studios',
       'Answer Software', 'Aqua Plus', 'Aques', 'Arc System Works',
       'Arena Entertainment', 'Aria', 'Arika', 'ArtDink', 'Aruze Corp',
       'Ascaron Entertainment', 'Ascaron Entertainment GmbH', 'Asgard',
       'Asmik Ace Entertainment', 'Asmik Corp', 'Aspyr', 'Astragon',
       'Asylum Entertainment', 'Atari', 'Athena', 'Atlus',
       'Avalon

array([nan, '0', '0.2', '0.3', '0.5', '0.6', '0.7', '0.9', '1', '1.1',
       '1.2', '1.3', '1.4', '1.5', '1.6', '1.7', '1.8', '1.9', '2', '2.1',
       '2.2', '2.3', '2.4', '2.5', '2.6', '2.7', '2.8', '2.9', '3', '3.1',
       '3.2', '3.3', '3.4', '3.5', '3.6', '3.7', '3.8', '3.9', '4', '4.1',
       '4.2', '4.3', '4.4', '4.5', '4.6', '4.7', '4.8', '4.9', '5', '5.1',
       '5.2', '5.3', '5.4', '5.5', '5.6', '5.7', '5.8', '5.9', '6', '6.1',
       '6.2', '6.3', '6.4', '6.5', '6.6', '6.7', '6.8', '6.9', '7', '7.1',
       '7.2', '7.3', '7.4', '7.5', '7.6', '7.7', '7.8', '7.9', '8', '8.1',
       '8.2', '8.3', '8.4', '8.5', '8.6', '8.7', '8.8', '8.9', '9', '9.1',
       '9.2', '9.3', '9.4', '9.5', '9.6', '9.7'], dtype=object)

array([nan, '10tacle Studios', '10tacle Studios, Fusionsphere Systems',
       ..., 'puzzle.tv', 'syn Sophia', 'zSlide'], dtype=object)

array([nan, 'AO', 'E', 'E10+', 'EC', 'K-A', 'M', 'RP', 'T'], dtype=object)

# TODO
* make true false categories for each genre
* normalize scores
* simplify publishers