## **Purpose**
 In this notebook we will scrape video game data from the VGChartz website (https://www.vgchartz.com/)
* We will collect the following information:
    * Game name
    * total_shipped
    * Developer name
    * Number of copies shipped
    * rank
    * platform the game is on
    * release_date
    * publisher name
    * number of sales in North America (na_sales)
    * Number of sales in Europe (eu_sales)
    * Number of Sales in Japan (jp_sales)
    * Number of sales in rest of the world (other_sales)
    * Overall Global sales
    * Genre of the game



In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import csv
import os

To prevent scraping data that we already collected I will check if the csv file already exisits. If it does then there is no need to scrape the data again. If it doesnt already exist then it will scrape the data.

Check if the csv file already exists

In [8]:
if not os.path.exists("../data/raw/Original_VGChartz.csv"):
    print("Missing dataset file")
    exisit=False
else:
    print("Success!")
    exists=True
    df= pd.read_csv("../data/raw/Original_VGChartz.csv", low_memory = False)

Success!


I will now scrape the necessary data from VGChartz.

In [12]:
if exists==False:
    pd.set_option('display.max_rows', 500)
    pd.set_option('display.max_columns', 500)
    pd.set_option('display.width', 1000)

    pages = 55
    genres=['Action', 'Action-Adventure', 'Adventure', 'Board+Game', 'Education', 'Fighting', 'Misc', 'MMO', 'Music', 'Party', 'Platform', 'Puzzle', 'Racing', 'Role-Playing', 'Sandbox', 'Shooter', 'Simulation', 'Sports', 'Strategy', 'Visual+Novel']
    rank = []
    gname = []
    platform = []
    release_date = []
    publisher = []
    sales_na = []
    sales_eu = []
    sales_jp = []
    sales_ot = []
    sales_gl = []
    developer= []
    total_shipped=[]
    game_url=[]
    game_genre=[]
    game_url_string=[]
    count=0
    fails=0

    urlhead = 'http://www.vgchartz.com/games/games.php?page='
    urlmid = '&results=200&name=&console=&keyword=&publisher=&genre='
    urltail = '&order=Sales&ownership=Both&boxart=Both&banner=Both&showdeleted=&region=All&goty_year=&developer=&direction=DESC&showtotalsales=1&shownasales=1&showpalsales=1&showjapansales=1&showothersales=1&showpublisher=1&showdeveloper=1&showreleasedate=1&showlastupdate=1&showvgchartzscore=1&showcriticscore=1&showuserscore=1&showshipped=1&alphasort=&showmultiplat=No'

    for genre in genres:
        for page in range(1,pages):
            surl = urlhead + str(page) +urlmid +genre + urltail    
            r = requests.get(surl)
            r = r.text
            soup = BeautifulSoup(r,'html.parser')
            for row in soup.find_all('tr'):
                try:
                    col=row.find_all('td')
                    col_0=col[0].text
                    col_4=col[4].text
                    col_5=col[5].text
                    col_9=col[6].text
                    col_10=col[10].text
                    col_11=col[11].text
                    col_12=col[12].text
                    col_13=col[13].text
                    col_14=col[14].text
                    col_15=col[15].text
                    img = col[3].find('img')
                    col_3=img['alt']
                    a_tag=col[2].find('a')
                    url_col=a_tag['href']
                    col_2=(a_tag.text)
                    url_string=url_col.rsplit('/', 2)[1]

                    if len(col_0)<6:
                        rank.append(col_0)
                        gname.append(col_2)
                        publisher.append(col_4)
                        developer.append(col_5)
                        total_shipped.append(col_9)
                        sales_gl.append(col_10)
                        sales_na.append(col_11)
                        sales_eu.append(col_12)
                        sales_jp.append(col_13)
                        sales_ot.append(col_14)
                        release_date.append(col_15)
                        platform.append(col_3)
                        game_url.append(url_col)
                        game_genre.append(genre)
                        game_url_string.append(url_string)
                        count+=1
                except:
                    fails+=1
                    continue
    print('vg_chartz count = '+str(count))
    print('vg_chartz fails = '+str(fails))
        
    columns = {'total_shipped' : total_shipped,
           'developer' : developer,
           'rank': rank,
           'name': gname,
           'platform': platform,
           'release_date': release_date,
           'publisher': publisher,
           'na_sales':sales_na,
           'eu_sales': sales_eu,
           'jp_sales': sales_jp,
           'other_sales':sales_ot,
           'global_sales':sales_gl,
           'game_genre':game_genre,
           'game_url':game_url,
           'game_url_string':game_url_string}

    df = pd.DataFrame(columns)
    df = df[['total_shipped','developer','rank','name','platform','release_date','publisher','na_sales','eu_sales','jp_sales','other_sales','global_sales','game_genre','game_url','game_url_string']]
    df.to_csv("VGChartz.csv",sep=",",encoding='utf-8')


This maps VGChartz Platform names to MetaCritic Platform names. VGChartz has sales data for nearly every platform including some of the older ones(NES, Atari, etc...). But, unfortunately MetaCritic does not keep track of some of those older platforms, so in that case, the MetaCritic data will be blank. So I decided to only include video games on platforms that MetaCritic keeps track of.

In [13]:
#Rewording platforms from vgchartz wording to metacritic wording for use in url
platform_rewording_dict = {'PS3': 'playstation-3',
                           'X360': 'xbox-360',
                           'PC': 'pc', 'WiiU': 'wii-u',
                           '3DS': '3ds',
                           'PSV': 'playstation-vita',
                           'iOS': 'ios',
                           'Wii': 'wii',
                           'DS': 'ds',
                           'PSP': 'psp',
                           'PS2': 'playstation-2',
                           'PS': 'playstation',
                           'XB': 'xbox', 
                           'GC': 'gamecube',
                           'GBA': 'game-boy-advance',
                           'DC': 'dreamcast',
                           'PS4':'playstation-4',
                           'XOne':'xbox-one',
                           'NS':'switch'
                          }

I Remove games from the data frame that arent on these platforms

This function turns the keys in the platform_rewording_dict into a list

In [14]:
def getList(dict): 
    list = [] 
    for key in dict.keys(): 
        list.append(key) 
          
    return list

In [15]:
platform_list=getList(platform_rewording_dict) 

In [16]:
df1=df[df['platform'].isin(platform_list)]
len(df1.index)

39613

Many games have no value for global sales. These games arent useful to us so I get rid of them.

In [17]:
#filtering for games that have at least 10,000 sales 
df1 = df1.replace('N/A', np.nan)
df2 = df1.dropna(subset=['global_sales'])
len(df2.index)

18336

I create a list of metacritic urls using game url string and platform. This url can be used to scrape MetaCritic for user score and further data 

In [18]:
#creating a list of metacritic urls  using game url string and platform to scrape for user score and further data 
meta_full_url_list=[]
meta_url= None
index=list(range(0,len(df2)))
df2.index=index

for row in range(0,len(df2)):
    plat_temp=df2.loc[row,'platform']
    url_string_temp=df2.loc[row,'game_url_string']
    if plat_temp in platform_rewording_dict.keys():
        meta_url='https://www.metacritic.com/game/'+ platform_rewording_dict[plat_temp]+'/'+url_string_temp
        meta_full_url_list.append(meta_url)

In [19]:
df2.sample()

Unnamed: 0.1,Unnamed: 0,total_shipped,developer,rank,name,platform,release_date,publisher,na_sales,eu_sales,jp_sales,other_sales,global_sales,game_genre,game_url,game_url_string
4146,10053,,Ubisoft Montreal,1330,Prince of Persia (2008),PC,09th Dec 08,Ubisoft,,0.02m,,0.00m,0.03m,Adventure,https://www.vgchartz.com/game/24854/prince-of-...,prince-of-persia-2008


I add the MetaCritic url to the dataframe. This will be used to combine the VGChartz data and the MetaCritic data.

In [60]:
column_values = pd.Series(meta_full_url_list)
df2.insert(loc=0, column='meta_url', value=column_values)
df2.name=df2.name.str.strip()
df2=df2.set_index('name')
df2.to_csv("100-Collecting_VGChartz.csv",sep=",",encoding='utf-8')