# Exploratory Data Analysis of VGChartz.com Video Game Sales Data

## Features to look into:
- Proximity of release date to holiday
- Day of week of release date
- price of console
- price of game
- whether developer is in top 5 or not
- location of developer
- number of years the game has been released
## Things to note/do
- make sure to remove data of games that were included with console ie. WII SPORTS

In [1]:
# if needed: pip install requests or conda install requests
import requests
from bs4 import BeautifulSoup
import re
import lxml.html as lh
import pandas as pd

In [14]:
# Ping one page of search results in VGChartz.com
url = 'http://www.vgchartz.com/games/games.php?page=1&results=10000&name=&console=&keyword=&publisher=&genre=&order=Sales&ownership=Both&boxart=Both&banner=Both&showdeleted=&region=All&goty_year=&developer=&direction=DESC&showtotalsales=1&shownasales=1&showpalsales=1&showjapansales=1&showothersales=1&showpublisher=1&showdeveloper=1&showreleasedate=1&showlastupdate=1&showvgchartzscore=1&showcriticscore=1&showuserscore=1&showshipped=1&alphasort=&showmultiplat=No'

response = requests.get(url)

### Check the Status
response.status_code # status code = 200 => OK

200

In [15]:
#Store the contents of the website under doc
page=response.text
soup = BeautifulSoup(page, "lxml")
doc = lh.fromstring(response.content)

In [16]:
# Selects the table with all the data in it on HTML using xpath
tr_elements = doc.xpath('//*[@id="generalBody"]/table')[0]


In [17]:
# Check where the table begins, it appears to be on index 2
[len(T) for T in tr_elements[:15]]

[2, 1, 16, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17]

In [18]:
# Verify we selected the table with the correct number of rows
len(tr_elements.xpath("./tr"))

10003

In [19]:
# Find the names of games from the links
names_list = list()
for row in tr_elements.xpath('.//tr'):
    for td in row.xpath('.//td'):
        if not td.find('a') is None:
            names_list.append(td.find('a').text.strip()) 

In [20]:
# Parse non-image and non-URL info from the data table to a pandas DataFrame
row_dict={}
df=pd.DataFrame()
row_list= list()
for counter,row in enumerate(tr_elements.xpath(".//tr")):
    if counter > 2:
        row_list=[td.text for td in row.xpath(".//td")]
        row_dict[counter] = row_list
# Test finding elements

df=pd.DataFrame.from_dict(row_dict).transpose()
df.columns = ['position','game','blank','console','publisher','developer','vgchart_score',\
             'critic_score','user_score','total_shipped','total_sales',\
              'na_sales','pal_sales','japan_sales','other_sales',\
              'release_date','last_update']

In [21]:
# Console tags are stored as images, so we find the image tag and record it
consoles = list()
for img in soup.find_all('img'):
    if 'images/consoles'in img['src']:
        console_tag = (img['src'][17:-6])
        consoles.append(img['alt'])


In [22]:
# Correct the console and game columns using scraped values
df=df.reset_index().drop(columns = 'index')
df['console'] = consoles
df['game'] = names_list

In [23]:
# Verify that correct data is in table
df.head()

Unnamed: 0,position,game,blank,console,publisher,developer,vgchart_score,critic_score,user_score,total_shipped,total_sales,na_sales,pal_sales,japan_sales,other_sales,release_date,last_update
0,1,Wii Sports,,Wii,Nintendo,Nintendo EAD,,7.7,,82.86m,,,,,,19th Nov 06,
1,2,Super Mario Bros.,,NES,Nintendo,Nintendo EAD,,10.0,8.2,40.24m,,,,,,18th Oct 85,
2,3,Mario Kart Wii,,Wii,Nintendo,Nintendo EAD,8.7,8.2,9.1,37.14m,,,,,,27th Apr 08,11th Apr 18
3,4,PlayerUnknown's Battlegrounds,,PC,PUBG Corporation,PUBG Corporation,,,,36.60m,,,,,,21st Dec 17,13th Nov 18
4,5,Wii Sports Resort,,Wii,Nintendo,Nintendo EAD,8.8,8.0,8.8,33.09m,,,,,,26th Jul 09,


In [24]:
df.shape

(10000, 17)

In [25]:
df.tail()

Unnamed: 0,position,game,blank,console,publisher,developer,vgchart_score,critic_score,user_score,total_shipped,total_sales,na_sales,pal_sales,japan_sales,other_sales,release_date,last_update
9995,9996,Black & Bruised,,PS2,Majesco,Digital Fiction,,,,,0.16m,0.08m,0.06m,,0.02m,26th Jan 03,
9996,9997,Naruto: Ultimate Ninja Heroes 2 - The Phantom ...,,PSP,Namco Bandai,CyberConnect2,,6.2,,,0.16m,0.14m,0.00m,,0.01m,24th Jun 08,
9997,9998,Take A Break's: Puzzle Master,,DS,Ubisoft,Ubisoft,,,,,0.16m,,0.15m,,0.01m,13th Nov 09,
9998,9999,Spyro: Shadow Legacy,,DS,Vivendi Games,Amaze Entertainment,,,,,0.16m,0.14m,0.01m,,0.01m,18th Oct 05,
9999,10000,Blitz: The League,,X360,Midway Games,Midway Games,,6.5,,,0.16m,0.15m,0.00m,,0.01m,13th Nov 06,
