# Elementary Statistical Analysis and Visualization of Video Game Data

### In this notebook I aim to grasp a better understanding of visual analysis in python by cutting my teeth on video game data scraped from the web. First I will scrape data from an online video games database. I will then munge and parse the data for use, and finally visualize several aspects of the data using python code in order to extract a variety of conclusions. While the end conclusions I reach from the data are not the overall goal of this project, I hope to come away with at least a few worthwhile ideas about games I may want to put on my "to-play" list.

## Section 0: Resources and Acknowledgements
### I will be consulting a number of websites and textbooks for this project. I will list them below as I use them, both as credit where it's due and to give anyone who may read this a good list of resources should they embark on a similar learning endeavor:

1: As always when I need a git refresher: Pro Git by Scott Chacon and Ben Straub - available at https://git-scm.com/book/en/v2

2: Twitch IGDB API Documentation - available at https://api-docs.igdb.com/#about

  2a: Twitch IGDB API Python Documentation - available at https://github.com/twitchtv/igdb-api-python


## Section 1: Scraping the Data

### I will be using the IGDB.com games database to scrape data for these analyses. The documentation for the IGDB API can be found in Secion 0.

In [39]:
#typically I would list all of my imports in one section, but as this is largely a learning exercise I will import packages in the sections at which they become relevant.
import numpy as np 
import pandas as pd 
from igdb.wrapper import IGDBWrapper
import requests
import json


#read credentials auth.json from project directory - this is user info and is hterefore kept private - auth.json is in .gitignore
user_info =  json.loads(open('auth.json').read())
user_info['grant_type']='client_credentials'


r = requests.post('https://id.twitch.tv/oauth2/token', params=user_info)
access_token = json.loads(r._content)['access_token']
expires_in = json.loads(r._content)['expires_in']
wrapper = IGDBWrapper(user_info['client_id'], access_token)

In [40]:
#scape platform data


platforms = wrapper.api_request(
    'platforms',
    'fields *; limit 500;'

)

platforms = platforms.decode("utf-8")
platforms_json = json.loads(platforms)

platforms_df = pd.DataFrame(platforms_json)

platforms_df.to_csv('./data/platforms.csv')





In [63]:
genres = wrapper.api_request(
    'genres',
    'fields *; limit 100;'
)

genres_json = genres.decode("utf-8")
genres_df = pd.DataFrame(json.loads(genres_json))

genres_df.to_csv('./data/genres.csv')

In [42]:
#test igdb query limits

games = wrapper.api_request(
    'games',
    'fields *; limit 500;'
)

games_json = games.decode("utf-8")
games_df = pd.DataFrame(json.loads(games_json))

print(games_df.shape)

(500, 47)


The IGDB API requires date queries to be in millisecond timestamps. For the purposes of this exploration, I have decided to query the 500 top-rated games for each year based on IGBD user ratings. In the following code blocks, I'll define those date ranges, convert them to millisecond timestamps, and run the relevant queries for each year.

In [74]:
import time, datetime

games_df = pd.DataFrame()

startYear = 2000
endYear = 2020

m = 1
d = 1

print(time.mktime(datetime.datetime(1985, 1, 1).timetuple()))

for y in range(startYear, endYear+1):

    year_begin = int(time.mktime(datetime.datetime(y, m, d).timetuple()))
    year_end = int(time.mktime(datetime.datetime(y, 1, 31).timetuple()))

    print(year_begin, year_end)

    q = 'fields *; where first_release_date > ' + str(year_begin) + ' &' + ' first_release_date < ' +  str(year_end) + '; sort rating desc;'
    

    games = wrapper.api_request(
        'games',
        q
    )

    year_games_json = games.decode("utf-8")
    year_games_df = pd.DataFrame(json.loads(games_json))

    games_df = games_df.append(year_games_df)


games_df.to_csv('./data/games.csv')
    



473407200.0
946706400 949298400
978328800 980920800
1009864800 1012456800
1041400800 1043992800
1072936800 1075528800
1104559200 1107151200
1136095200 1138687200
1167631200 1170223200
1199167200 1201759200
1230789600 1233381600
1262325600 1264917600
1293861600 1296453600
1325397600 1327989600
1357020000 1359612000
1388556000 1391148000
1420092000 1422684000
1451628000 1454220000
1483250400 1485842400
1514786400 1517378400
1546322400 1548914400
1577858400 1580450400


In [66]:
games = wrapper.api_request(
    'games',
    'fields *; where first_release_date < 1538129354;'
)

print(games)

b'[\n  {\n    "id": 70,\n    "aggregated_rating": 92.0,\n    "aggregated_rating_count": 1,\n    "category": 0,\n    "cover": 71,\n    "created_at": 1297900800,\n    "external_games": [\n      8545,\n      78221,\n      154955,\n      1193716\n    ],\n    "first_release_date": 825984000,\n    "follows": 2,\n    "game_modes": [\n      1\n    ],\n    "genres": [\n      5,\n      13,\n      15\n    ],\n    "involved_companies": [\n      709\n    ],\n    "keywords": [\n      453,\n      970,\n      1025,\n      1669,\n      3782,\n      6351\n    ],\n    "name": "Terra Nova: Strike Force Centauri",\n    "platforms": [\n      13\n    ],\n    "player_perspectives": [\n      1\n    ],\n    "rating": 70.0,\n    "rating_count": 2,\n    "release_dates": [\n      544\n    ],\n    "screenshots": [\n      84914,\n      84915,\n      84916,\n      84917,\n      84918\n    ],\n    "similar_games": [\n      13200,\n      16806,\n      17130,\n      24620,\n      25311,\n      34823,\n      55038,\n    